Start¶

Import the necesary libraries

Import necesary libraries to start, however more libraries are going to be imported further ahead

In [228]:
import pandas as pd
import calendar
import seaborn as sns
import time
import tensorflow as tf
import numpy as np
import scipy.stats as stats
import dataframe_image as dfi

from matplotlib import pyplot as plt
%matplotlib inline
from math import sqrt

print('Libraries have been imported !')
Libraries have been imported !

Dataset¶

Open the excel file and open the sheet named: '6. Precio OIC Mensual' Delete the first 5 Rows and rename the Columns

In [229]:
# to use the URL
file = 'https://federaciondecafeteros.org/app/uploads/2020/01/Precios-%C3%A1rea-y-producci%C3%B3n-de-caf%C3%A9.xlsx'

# to use the local file
#file = pd.ExcelFile('Precios-área-y-producción-de-café.xlsx')

# name of the sheet '6. Precio OIC Mensual'
df = pd.read_excel(file, sheet_name = '6. Precio OIC Mensual')

df.head()
Out[229]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15
0 NaN NaN Precios indicativos OIC por grupos - Promedio ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN Centavos de dólar por libra NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN Fuente: ICO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN Suaves colombianos (arábigo) NaN NaN Otros suaves (arábigo) NaN NaN Naturales del Brasil (arábigo) NaN NaN Robustas NaN NaN NaN
In [230]:
# removes the first column as it doesn't have any data
df = df.drop(df.columns[[0,-1]], axis = 1)

# remove the first 5 columns as is part of the format in the excel file
df = df.drop(index = df.index[0:6],
       axis = 0)

# because the first 5 rows where deleted, I need to reset the index to be 0 and not to start at row # 6
df = df.reset_index(drop=True)
In [231]:
# renames the column for an easier comprenhesion 
df = df.rename(columns = {'Unnamed: 1': 'Date', 'Unnamed: 2':'OIC_price',
                          'Unnamed: 3':'Colombia_ny', 'Unnamed: 4':'Colombia_europe', 'Unnamed: 5':'Colombia_average',
                          'Unnamed: 6':'Other_ny', 'Unnamed: 7':'Other_europe','Unnamed: 8':'Other_average',
                          'Unnamed: 9':'Brazil_ny', 'Unnamed: 10':'Brazil_europe', 'Unnamed: 11':'Brazil_average',
                          'Unnamed: 12':'Robustas_ny', 'Unnamed: 13':'Robustas_europe','Unnamed: 14':'Robustas_average'})
df.head()
Out[231]:
Date OIC_price Colombia_ny Colombia_europe Colombia_average Other_ny Other_europe Other_average Brazil_ny Brazil_europe Brazil_average Robustas_ny Robustas_europe Robustas_average
0 2000-01-01 00:00:00 82.15 130.12 124.36 130.13 109.17 116.82 111.11 97.67 103.1 97.68 53.62 52.41 53.18
1 2000-02-01 00:00:00 76.15 124.72 118.67 124.73 101.17 110.19 103.44 91.51 96.58 91.51 49.41 47.97 48.85
2 2000-03-01 00:00:00 73.49 119.51 115.78 119.51 98.26 108.13 100.73 89.93 94.78 89.93 47.26 44.73 46.25
3 2000-04-01 00:00:00 69.53 112.67 109.12 112.67 92.41 101.51 94.61 86.46 90.7 86.46 45.21 43.31 44.45
4 2000-05-01 00:00:00 69.22 110.31 107.85 110.31 91.76 100.99 94.17 87.23 91.01 87.23 45.19 43.01 44.32

Exploratory Data Analysis¶

Description of each Column

Date: Column expressing the date montly beginning January of 2000, all prices will have a reference for this date.

OIC_price: Is the average price of the International Coffee Organization for the month and year shown, measured in US cents/lb

Colombia_ny: Is the average price of Colombian Mild Arabicas for the month and year shown and is expresed in US cents/lb, in the US market

Colombia_europe: Is the average price of Colombian Mild Arabicas for the month and year shown and is expresed in US cents/lb, in Germany and France

Colombia_average: Is the weighted average Colombian Mild Arabicas Composite Indicator Price, as per "Section 4 "of the Indicator Prices SC-106/21 for the month and year shown and is expresed in US cents/lb

Other_ny: Is the average price of Other Mild Arabicas for the month and year shown and is expresed in US cents/lb, in the US market

Other_europe: Is the average price of Other Mild Arabicas for the month and year shown and is expresed in US cents/lb, in Germany and France

Other_average: Is the average price of Other Mild Arabicas Composite Indicator Price, as per "Section 4 "of the Indicator Prices SC-106/21 for the month and year shown and is expressed in US cents/lb

Brazil_ny: Is the average price of Brazilian Naturals for the month and year shown and is expresed in US cents/lb, in the US market

Brazil_europe: Is the average price of Brazilian Naturals for the month and year shown and is expresed in US cents/lb, in Germany and France

Brazil_average: Is the average price of Brazilian Naturals Composite Indicator Price, as per "Section 4 "of the Indicator Prices SC-106/21 for the month and year shown and is expressed in US cents/lb

Robustas_ny: Is the average price of Robustas for the month and year shown and is expresed in US cents/lb, in the US market

Robustas_europe: Is the average price of Robustas for the month and year shown and is expresed in US cents/lb, in Germany and France

Robustas_average: Is the average price of Robustas Composite Indicator Price, as per "Section 4 "of the Indicator Prices SC-106/21 for the month and year shown and is expressed in US cents/lb

[Indicator Prices SC-106/21 ] (https://www.ico.org/documents/cy2020-21/sc-106e-rules-indicator-prices.pdf)

Checking how many NA's are there in the dataset, what type of each column is, how many columns we have and in general all the information from the data

In [232]:
# create a copy of the data frame to modify it and to keep the original intact
eda = df.copy()

# checking that both dataframes are diffetent in memory
print(f' Memory for df: {id(df)} ----- Memory for eda: {id(eda)}')
 Memory for df: 2813154014784 ----- Memory for eda: 2813120644864
In [233]:
# exploring the data types, the amount of row and if there are NA's
eda.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Date              272 non-null    object
 1   OIC_price         272 non-null    object
 2   Colombia_ny       272 non-null    object
 3   Colombia_europe   272 non-null    object
 4   Colombia_average  272 non-null    object
 5   Other_ny          272 non-null    object
 6   Other_europe      272 non-null    object
 7   Other_average     272 non-null    object
 8   Brazil_ny         272 non-null    object
 9   Brazil_europe     272 non-null    object
 10  Brazil_average    272 non-null    object
 11  Robustas_ny       272 non-null    object
 12  Robustas_europe   272 non-null    object
 13  Robustas_average  272 non-null    object
dtypes: object(14)
memory usage: 29.9+ KB

There are a total of 14 Columns, each with 272 rows, and the type of data per column is the type Object that is like a String type, which will have to be converted to float type and time series for the column Date

In [234]:
# checking if tehre are NA's 
eda.isna().sum()
Out[234]:
Date                0
OIC_price           0
Colombia_ny         0
Colombia_europe     0
Colombia_average    0
Other_ny            0
Other_europe        0
Other_average       0
Brazil_ny           0
Brazil_europe       0
Brazil_average      0
Robustas_ny         0
Robustas_europe     0
Robustas_average    0
dtype: int64

There are 272 rows

The data does not have NA’s or empty cells

Duplicates on each column

Colombia_europe [20 - 40] On years 2001-09 and 2003-05, it was exactly the same price of the coffe

Brazil_ny[75 - 89] On years 2006-04 and 2007-06, it was exactly the same price of the coffe

Other_ny [158 - 159] On years 2013-03 and 2013-04, it was exactly the same price of the coffe

Other_europe [29 - 41] On years 2002-06 and 2003-06, it was exactly the same price of the coffe

Robustas_europe [118 - 119 - 130 - 183] On years 2010-11 and 2020-04, it was exactly the same price of the coffe

Robustas_ny [35 - 39] On years 2002-12 and 2003-04, it was exactly the same price of the coffe

In [235]:
# checking if there are duplicated values 
dup_OIC = eda[eda.duplicated(['OIC_price'],keep = False)]
dup_Col_ny = eda[eda.duplicated(['Colombia_ny'],keep = False)]
dup_Col_europe = eda[eda.duplicated(['Colombia_europe'],keep = False)]
dup_Col_average = eda[eda.duplicated(['Colombia_average'],keep = False)]
dup_Other_ny= eda[eda.duplicated(['Other_ny'],keep = False)]
dup_Other_europe= eda[eda.duplicated(['Other_europe'],keep = False)]
dup_Other_average= eda[eda.duplicated(['Other_average'],keep = False)]
dup_Brazil_ny= eda[eda.duplicated(['Brazil_ny'],keep = False)]
dup_Brazil_europe= eda[eda.duplicated(['Brazil_europe'],keep = False)]
dup_Brazil_average= eda[eda.duplicated(['Brazil_average'],keep = False)]
dup_Robustas_ny= eda[eda.duplicated(['Robustas_ny'],keep = False)]
dup_Robustas_europe= eda[eda.duplicated(['Robustas_europe'],keep = False)]
dup_Robustas_average= eda[eda.duplicated(['Robustas_average'],keep = False)]
#print(dup_Robustas_europe)

The types of the columns are object, Changing them for numeric type float, so we can see the statistics

In [236]:
# changes the format of the column 'Date' for just the year and the month
eda['Date'] = pd.to_datetime(eda['Date'], format = '%d%m%Y')

lista = list(eda.columns)
lista.pop(0)

for item in lista:
    eda[item] = eda[item].astype(float)
    
#checking descriptive statistics
eda.describe()

# saves the table as a png or svg
#df_styled = eda.describe().style.background_gradient()
#dfi.export(df_styled, "table statistics.png")
Out[236]:
OIC_price Colombia_ny Colombia_europe Colombia_average Other_ny Other_europe Other_average Brazil_ny Brazil_europe Brazil_average Robustas_ny Robustas_europe Robustas_average
count 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000 272.000000
mean 113.633237 153.333888 148.409523 151.222932 145.099616 143.098234 143.966330 118.365112 123.408718 121.959193 79.670588 74.456376 75.376692
std 43.687941 62.437782 59.732402 61.156478 58.414410 55.916915 56.927377 49.732689 51.170812 50.984818 28.448193 26.567300 26.767165
min 41.170000 58.920000 57.720000 58.100000 51.950000 55.760000 54.280000 37.670000 38.710000 38.630000 21.250000 22.790000 22.810000
25% 88.547500 112.940000 111.697500 112.670000 108.720000 110.295000 109.712500 94.405000 96.056883 95.605714 57.895000 54.765000 55.347500
50% 113.155682 144.413636 141.123636 143.529552 141.896818 138.104348 140.704773 111.980000 117.983409 116.833333 84.680554 78.150682 79.203636
75% 133.130252 182.577237 178.760455 179.022500 169.232857 165.942045 166.686126 132.726023 143.199599 140.728880 103.530147 97.329432 98.367841
max 231.240000 319.633750 311.450000 312.950000 303.590000 297.220000 300.120000 271.390000 273.430000 273.400000 126.300000 121.300000 121.980000

Note that Colombian_average maximum is almost 3 times the price of the Robustas_average which is predominantly from Vietnam, the reason is that “Robustas has a high caffeine content (2% to 4%), so the flavor is not as pure as Arabica” (Roldan Perez et al., 2009,32), “quality of the Robusta produced is uneven because of processing technology, drying equipment and post-harvest technological problems. These cause the coffee beans to have a high humidity level, and not meet the required standard of color, quality, and so on. This is the reason that Vietnam’s coffee price is lower than the world price.” (Roldan Perez et al., 2009,32).

The 3rd Quantil (75%) of Robustas_average is below the 1st Quantil (25%) of Colombian_average and Other_average, and almost the same for the 1st Quantil (25%) of Brazil_average.

The correlation between the attributes

In [237]:
# to check the correlation
eda.corr()
Out[237]:
OIC_price Colombia_ny Colombia_europe Colombia_average Other_ny Other_europe Other_average Brazil_ny Brazil_europe Brazil_average Robustas_ny Robustas_europe Robustas_average
OIC_price 1.000000 0.961247 0.974154 0.968370 0.983946 0.987328 0.986648 0.981872 0.988744 0.989057 0.902821 0.883472 0.888010
Colombia_ny 0.961247 1.000000 0.989911 0.997974 0.978964 0.978009 0.977207 0.953280 0.950817 0.952474 0.795684 0.761376 0.768793
Colombia_europe 0.974154 0.989911 1.000000 0.996692 0.979493 0.986822 0.982911 0.962704 0.967117 0.967156 0.810751 0.783200 0.789372
Colombia_average 0.968370 0.997974 0.996692 1.000000 0.981455 0.984258 0.981992 0.959397 0.959980 0.960859 0.802224 0.770656 0.777598
Other_ny 0.983946 0.978964 0.979493 0.981455 1.000000 0.995489 0.996963 0.971176 0.977323 0.977294 0.845858 0.812284 0.819485
Other_europe 0.987328 0.978009 0.986822 0.984258 0.995489 1.000000 0.997574 0.975087 0.984532 0.983430 0.841851 0.813314 0.819732
Other_average 0.986648 0.977207 0.982911 0.981992 0.996963 0.997574 1.000000 0.973195 0.981179 0.980708 0.846334 0.816121 0.822735
Brazil_ny 0.981872 0.953280 0.962704 0.959397 0.971176 0.975087 0.973195 1.000000 0.994153 0.996667 0.838267 0.819359 0.823952
Brazil_europe 0.988744 0.950817 0.967117 0.959980 0.977323 0.984532 0.981179 0.994153 1.000000 0.999492 0.856312 0.837155 0.841898
Brazil_average 0.989057 0.952474 0.967156 0.960859 0.977294 0.983430 0.980708 0.996667 0.999492 1.000000 0.854775 0.835624 0.840293
Robustas_ny 0.902821 0.795684 0.810751 0.802224 0.845858 0.841851 0.846334 0.838267 0.856312 0.854775 1.000000 0.991836 0.994478
Robustas_europe 0.883472 0.761376 0.783200 0.770656 0.812284 0.813314 0.816121 0.819359 0.837155 0.835624 0.991836 1.000000 0.999689
Robustas_average 0.888010 0.768793 0.789372 0.777598 0.819485 0.819732 0.822735 0.823952 0.841898 0.840293 0.994478 0.999689 1.000000

Creating a correlation matrix between all the columns, can be seen that there is a high correlation between all the averages, not that much with Robustas_average, but there is still a correlation, the lowest correlation is between Robustas_average and Colombia_ny with a value of 0.76, and with our target column which is OIC_price, the lowest correlation is Robustas_europe with 0.88

Overall, all columns have a high correlation between all of them, and it can be seen as well that the average columns have a correlation of 1 with the same group of coffee that they are averaging.

In [238]:
# checking the correlation visually on a heat map

plt.figure(figsize = (20,8))
hm = sns.heatmap(eda.corr(), annot = True, vmin = -1, vmax = 1)
hm.set_title('Heatmap Correlation', fontdict = {'fontsize':16});
#plt.savefig('Heat Map.svg')
In [239]:
# Doing a scatter plot matrix just with the average columns
sns.pairplot(eda[['OIC_price', 'Colombia_average', 'Other_average', 'Brazil_average', 'Robustas_average']]);
#plt.savefig('scatter plot.svg')

The scatterplot between all average columns and OIC_price confirms the correlations seen above on the matrix.

In [240]:
# Doing a scatter plot matrix with the all columns
sns.pairplot(eda[['Date', 'OIC_price', 'Colombia_ny', 'Colombia_europe',
       'Colombia_average', 'Other_ny', 'Other_europe', 'Other_average',
       'Brazil_ny', 'Brazil_europe', 'Brazil_average', 'Robustas_ny',
       'Robustas_europe', 'Robustas_average']]);
#plt.savefig('scatter plot.jpg')

There is a linear relationship between almost all columns, but this can be seen better in a heat map.

Doing a box plot for each of the columns

In [241]:
# incresesthe size of the graph
plt.figure(figsize=(30,20))

# increases the size of the font
sns.set(font_scale=2)

# sets the data and the orientation
s = sns.boxplot(data = eda, orient = 'h')

# label in x and the size of it
s.set_xlabel('Price', fontsize = 25)

# label in y and the size of it
s.set_ylabel('Market', fontsize = 25)

# title and its size
s.set_title('Price vs Coffee Market', fontsize = 35)

# show the graph!
plt.show();

The median of OIC_price is basically where the lower quartile Q1 of all “Colombian” columns start as well as the “Other” columns

The median of “Robustas” columns is smaller than the lower quartile Q1 of OIC_price column

The upper quartile or Q3 of “Robustas” is smaller than the lower quartile Q1 of all the columns with the exception of OIC_price and “Robustas_ny”

Brazil_ny has a small IQR, while all “Colombia” columns have a big IQR

The type of coffe Robustas, has the lowest prices, while Colombian Coffee has the highest prices and the 25% of Colombian Coffee price is basically the median of the OIC Price

Graphing all columns, as a time series, to compare all of them together

In [242]:
# create a list with th enames of the columns
lista = list(eda.columns)
#  removes the first item of the list, which is 'Date'
lista.pop(0)

# increases the size of the font
sns.set(font_scale=1)
# selects the size of the column
plt.figure(figsize=(25,6))
# plots the axes
ax = plt.gca()

# creates a loop graphing all the columns vs 'Date'
for column in lista:
    eda.plot(x ='Date', y = column, ax = ax)
    
#plt.savefig('Time vs price all.svg')

The date column is on a monthly basis for a period of 10 years and behaves as a queue, where it has a front and rear end every time a new month is added to the end of the queue, the month on the front is withdrawn from the queue.

On the Graph distribution between the averages over time can be seen that there some coffees reached up to $300 dollars twice during the decade and had lots of peaks, while the robustas had a more flat-like behavior.

Comparing just the averages and the OIC prices

In [243]:
lista_averages = ['OIC_price', 'Colombia_average', 'Other_average', 'Brazil_average', 'Robustas_average']
# selects the size of the column
plt.figure(figsize=(20,6))
# plots the axes
ax = plt.gca()

# creates a loop graphing all the columns vs 'Date'
for column in lista_averages:
    eda.plot(x ='Date', y = column, ax = ax)
    
#plt.savefig('Time vs price averages.svg')
In [244]:
#'OIC_price', 'Colombia_average', 'Other_average', 'Brazil_average', 'Robustas_average'
eda_subset = eda.drop(['Date'], axis = 1)
In [245]:
# counts the number of outliers per column
def IQR(df):
    q1 = df.quantile(0.25)
    q3 = df.quantile(0.75)
    IQR = q3 - q1
    x = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
    print(f'Number of outliers {str(len(x))}')
    print(f'max outlier {x.max()}, min outlier {x.min()}')

for i in eda_subset:
    print('\n',i)
    print(IQR(eda_subset[i]))
 OIC_price
Number of outliers 13
max outlier 231.24, min outlier 200.11
None

 Colombia_ny
Number of outliers 16
max outlier 319.63375, min outlier 288.43
None

 Colombia_europe
Number of outliers 9
max outlier 311.45, min outlier 283.74
None

 Colombia_average
Number of outliers 19
max outlier 312.95, min outlier 279.55681818181813
None

 Other_ny
Number of outliers 19
max outlier 303.59, min outlier 262.94
None

 Other_europe
Number of outliers 19
max outlier 297.22, min outlier 250.75
None

 Other_average
Number of outliers 19
max outlier 300.12, min outlier 255.90657034106954
None

 Brazil_ny
Number of outliers 25
max outlier 271.39, min outlier 201.60674603174598
None

 Brazil_europe
Number of outliers 23
max outlier 273.43, min outlier 216.46838264311285
None

 Brazil_average
Number of outliers 24
max outlier 273.4, min outlier 214.80402869732916
None

 Robustas_ny
Number of outliers 0
max outlier nan, min outlier nan
None

 Robustas_europe
Number of outliers 0
max outlier nan, min outlier nan
None

 Robustas_average
Number of outliers 0
max outlier nan, min outlier nan
None

There are no outliers in the Robustas columns, but the other columns do have outliers

In [246]:
# plotting just the average columns and the OIC_price
plt.figure(figsize=(20,8))
plt.subplot(2,3,1)
plt.hist(eda['OIC_price'], color = 'green')
plt.title('Histogram of OIC_price')

plt.subplot(2,3,2)
plt.hist(eda['Colombia_average'])
plt.title('Histogram of Colombia_average')

plt.subplot(2,3,3)
plt.hist(eda['Other_average'])
plt.title('Histogram of Other_average')

plt.subplot(2,3,4)
plt.hist(eda['Brazil_average'])
plt.title('Histogram of Brazil_average')

plt.subplot(2,3,5)
plt.hist(eda['Robustas_average'])
plt.title('Histogram of Robustas_average');

#plt.savefig('hist averages.svg')
In [247]:
fig, axes = plt.subplots(5,3, figsize = (20,22))
sns.histplot(eda_subset['Colombia_ny'], kde = True, ax = axes[0,0], color = 'purple')
sns.histplot(eda_subset['Colombia_europe'], kde = True, ax = axes[0,1], color = 'purple')
sns.histplot(eda_subset['Colombia_average'], kde = True, ax = axes[0,2], color = 'purple')

sns.histplot(eda_subset['Other_ny'], kde = True, ax = axes[1,0], color = 'red')
sns.histplot(eda_subset['Other_europe'], kde = True, ax = axes[1,1], color = 'red')
sns.histplot(eda_subset['Other_average'], kde = True, ax = axes[1,2], color = 'red')

sns.histplot(eda_subset['Brazil_ny'], kde = True, ax = axes[2,0], color = 'blue')
sns.histplot(eda_subset['Brazil_europe'], kde = True, ax = axes[2,1], color = 'blue')
sns.histplot(eda_subset['Brazil_average'], kde = True, ax = axes[2,2], color = 'blue')

sns.histplot(eda_subset['Robustas_ny'], kde = True, ax = axes[3,0], color = 'olive')
sns.histplot(eda_subset['Robustas_europe'], kde = True, ax = axes[3,1], color = 'olive')
sns.histplot(eda_subset['Robustas_average'], kde = True, ax = axes[3,2], color = 'olive')

sns.histplot(eda_subset['OIC_price'], kde = True, ax = axes[4, 0], color = 'black')
plt.show()

All columns follow almost the same distribution pattern, however, Robustas_average is quite different from the others, and the data is not that “normally distributed”. Please note that Robustas type coffee follows a different distribution that looks like a bimodal distribution, it has two local maxima points that are notable in the graph.

The variable OIC_price, here in color black, follows almost the same distribution of all the other columns, with a slight peak at around US $55 dollars.

In [248]:
# grouping by the year and checking the average price every year
eda.groupby(eda['Date'].dt.year).mean()
#df_styled = eda.groupby(eda['Date'].dt.year).mean().style.background_gradient()
#dfi.export(df_styled, "avg per year.svg")
Out[248]:
OIC_price Colombia_ny Colombia_europe Colombia_average Other_ny Other_europe Other_average Brazil_ny Brazil_europe Brazil_average Robustas_ny Robustas_europe Robustas_average
Date
2000 64.245000 102.600833 99.801667 102.603333 85.094167 92.889167 87.075000 79.861667 83.666667 79.862500 42.118333 40.358333 41.413333
2001 45.591667 72.205000 68.235833 72.050833 61.937500 63.139167 62.283333 50.523333 52.415000 50.699167 27.300833 27.485833 27.540833
2002 47.755000 65.265000 64.780833 64.901667 60.439167 62.345833 61.554167 45.103333 45.920000 45.234167 30.835000 29.776667 30.027500
2003 51.896667 67.305000 64.336667 65.329167 64.088333 64.301667 64.199167 50.820000 50.158333 50.314167 38.391667 36.495833 36.944167
2004 62.150833 83.850833 79.492500 81.438333 80.145000 80.643333 80.466667 68.175833 69.110000 68.968333 37.279167 35.657500 35.978333
2005 89.344167 117.000000 114.670833 115.730833 114.292500 114.826667 114.830833 101.329167 102.485000 102.290000 53.375833 49.859167 50.505000
2006 95.745000 117.915000 115.704167 116.804167 113.950833 114.795833 114.400833 102.878333 104.186667 103.920833 70.282500 66.982500 67.558333
2007 107.679167 126.740833 124.703333 125.570000 123.163333 123.808333 127.832500 110.689167 112.064167 111.791667 88.263333 86.295000 86.371667
2008 125.465787 146.076027 144.266040 145.371284 139.630237 141.985913 141.213821 124.471924 129.475855 128.350946 107.656345 106.318875 106.561649
2009 115.670166 180.873853 174.579426 177.432834 141.651272 145.480386 143.841181 111.393568 116.553699 115.325063 77.159199 74.020763 74.578187
2010 148.160000 224.622500 227.076667 226.333333 195.440833 197.624167 196.968333 146.682500 156.916667 154.663333 85.074167 78.455833 79.552500
2011 210.389167 283.817500 283.668333 283.836667 273.201667 269.545000 271.073333 243.667500 248.723333 247.615000 115.992500 107.911667 109.206667
2012 156.358660 203.947778 200.525814 202.145996 187.591598 185.759147 186.534793 171.367051 176.131668 175.028858 110.578368 101.296184 102.763990
2013 119.511515 148.253571 147.531932 147.868106 141.080238 138.417803 139.525152 117.948095 123.557917 122.233409 100.495635 92.953447 94.160833
2014 155.259167 198.090833 198.164167 197.949167 202.849167 199.078333 200.391667 161.298333 175.288333 171.592500 105.600833 99.469167 100.433333
2015 124.672245 149.875938 154.017292 151.797906 160.532485 159.540010 159.939615 123.107688 135.718145 132.450318 94.199219 86.844965 88.048500
2016 127.381287 155.582169 155.369568 155.378077 164.632859 163.487515 163.876739 124.183534 142.724285 137.864849 94.276935 87.470577 88.629187
2017 126.678882 154.069886 150.412525 152.374415 152.410369 149.504372 150.726456 126.551287 133.781123 131.907557 104.091137 100.278229 100.950453
2018 109.035612 139.591534 133.260508 136.698098 137.401621 129.104963 132.727196 109.619033 115.099169 113.648542 88.343472 84.029311 84.795899
2019 100.517620 137.065562 129.187535 133.599465 137.459949 125.524315 130.657423 100.073484 101.988932 101.527120 80.055400 72.153073 73.562791
2020 107.937806 166.277684 146.097829 157.665431 156.809259 146.766945 150.723610 102.455498 107.856665 106.423287 78.201469 66.676662 68.753262
2021 151.288532 228.421976 206.527908 219.045860 209.884778 201.421142 204.624669 158.225545 163.010229 161.701507 99.806236 87.735133 89.874791
2022 199.429155 309.178232 272.306706 293.692236 277.856212 260.354944 266.655479 228.774979 225.648634 226.492861 114.733619 103.729691 105.490708

Note, that “the daily price increased by over 50% between 30 January and 10 March, as the ongoing drought in Brazil and uncertainty over the 2014/15 crop put upward pressure on prices” (ICO.ORG, 2014), this is particularly important due that January 2014 was one of the hottest months in “Brazil, which produces nearly 40% of the world’s coffee” (Wile, 2014), and lots of the crops were lost on this year

In [249]:
# 2011 has been the higest price for coffee for 'OIC_price', even 10 years after today!
avg_year = eda.groupby(eda['Date'].dt.year)['OIC_price'].mean()
avg_year.agg(['min','max'])
Out[249]:
min     45.591667
max    210.389167
Name: OIC_price, dtype: float64
In [250]:
# ploting the Average per year
avg_year.plot( kind = 'bar', figsize = (20,7));
#plt.savefig('avg_year.svg')

Plotting the average price per year as a distribution graph, shows the Average Maximum Price of OIC has been US 210.38 in the year 2011, while the Average Minimum Price of IC has been US 45.59 in the year 2001.

In [251]:
# Months of February and March are when the coffe has the highest prices, not a significant increase in price with only 3%
avg_month = eda.groupby(eda['Date'].dt.month_name())['OIC_price'].mean()
avg_month.agg(['min','max'])
Out[251]:
min    111.489454
max    115.209163
Name: OIC_price, dtype: float64
In [252]:
# ploting the average each month
avg_month.plot( kind = 'bar', figsize = (15,4));

The months of February and March, are when the coffee has the highest prices, however, there is not a significant increase in price, only 3%

Minimum Month price: 111.48

Maximum Month price: 115.20

Start of Machine Learning Regression Algorithms with train_test_split()¶

In [253]:
# creates a copy of the original dataframe
test = df.copy()

# saves the column 'Date' as date  and removes the column 'Date' from the new dataframe 'test'
date = test['Date']
test = test.drop(['Date'], axis = 1)

test = test.astype(float) # converts everything to float

Split the data frame intro Train (X) and Test (y)¶

In [254]:
import statsmodels.regression.linear_model as sm
import statsmodels.api as am

X = test.iloc[:, 1:] # All the variables but OIC_price
y = test.iloc[:, 0] # OIC_price

Get the OLS (Ordinary Least squares) and Obtain the Model¶

As there are 14 columns in our data set, in order to build a proper model to do regression, there will be a limit of SL = 0.05 and obtain the p-values for each of the columns, please note that the independent column is OIC_price, and there won’t be an analysis for this column, and the same will be done with the column Date.

In order to obtain an equation that satisfies the model of multiple linear regression

I’ll obtain the p-value, using a backward stepwise regression, and if the p-values for the columns are > SL, then it will be discarded.

In [255]:
X = am.add_constant(X) # adds a constant to the model and evaluates all the columns

result = am.OLS(y, X).fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              OIC_price   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 3.291e+04
Date:                Sat, 03 Dec 2022   Prob (F-statistic):               0.00
Time:                        20:31:42   Log-Likelihood:                -415.90
No. Observations:                 272   AIC:                             857.8
Df Residuals:                     259   BIC:                             904.7
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                0.3823      0.277      1.378      0.169      -0.164       0.929
Colombia_ny          0.3603      0.049      7.366      0.000       0.264       0.457
Colombia_europe      0.4448      0.042     10.672      0.000       0.363       0.527
Colombia_average    -0.6751      0.088     -7.703      0.000      -0.848      -0.502
Other_ny             0.0438      0.021      2.090      0.038       0.003       0.085
Other_europe         0.1799      0.030      6.090      0.000       0.122       0.238
Other_average        0.0339      0.022      1.526      0.128      -0.010       0.078
Brazil_ny           -0.1526      0.037     -4.093      0.000      -0.226      -0.079
Brazil_europe       -0.7977      0.077    -10.403      0.000      -0.949      -0.647
Brazil_average       1.2232      0.107     11.406      0.000       1.012       1.434
Robustas_ny          0.3961      0.060      6.596      0.000       0.278       0.514
Robustas_europe      2.0278      0.255      7.959      0.000       1.526       2.530
Robustas_average    -2.0975      0.308     -6.821      0.000      -2.703      -1.492
==============================================================================
Omnibus:                       93.960   Durbin-Watson:                   0.725
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              801.467
Skew:                          -1.124   Prob(JB):                    9.20e-175
Kurtosis:                      11.103   Cond. No.                     2.77e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.77e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

There are two columns with a high p-value, but only one will be removed in this step, in this case, the column with the constant value, which has a p-value of 0.169

In [256]:
# after removing Constant and with all the columns
X = test.iloc[:, [1,2,3,4,5,6,7,8,9,10,11,12]]

result = am.OLS(y, X).fit()
print(result.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              OIC_price   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          2.556e+05
Date:                Sat, 03 Dec 2022   Prob (F-statistic):                        0.00
Time:                        20:31:42   Log-Likelihood:                         -416.89
No. Observations:                 272   AIC:                                      857.8
Df Residuals:                     260   BIC:                                      901.1
Df Model:                          12                                                  
Covariance Type:            nonrobust                                                  
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Colombia_ny          0.3665      0.049      7.512      0.000       0.270       0.463
Colombia_europe      0.4508      0.042     10.856      0.000       0.369       0.533
Colombia_average    -0.6899      0.087     -7.919      0.000      -0.861      -0.518
Other_ny             0.0366      0.020      1.802      0.073      -0.003       0.077
Other_europe         0.1953      0.027      7.126      0.000       0.141       0.249
Other_average        0.0349      0.022      1.568      0.118      -0.009       0.079
Brazil_ny           -0.1293      0.033     -3.885      0.000      -0.195      -0.064
Brazil_europe       -0.7743      0.075    -10.337      0.000      -0.922      -0.627
Brazil_average       1.1698      0.100     11.680      0.000       0.973       1.367
Robustas_ny          0.3887      0.060      6.487      0.000       0.271       0.507
Robustas_europe      1.9771      0.253      7.828      0.000       1.480       2.474
Robustas_average    -2.0351      0.305     -6.679      0.000      -2.635      -1.435
==============================================================================
Omnibus:                       80.667   Durbin-Watson:                   0.711
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              638.622
Skew:                          -0.940   Prob(JB):                    2.11e-139
Kurtosis:                      10.267   Cond. No.                     2.72e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.72e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Then after removing the constant column, there are more columns again with different value, and once again, I’ll remove the one with the highest p-value, in this case, is column number 6 called Other_average, with a p-value of 0.118

In [257]:
# after removing #6 Other_average, and keeping the rest of the columns
X = test.iloc[:, [1,2,3,4,5,7,8,9,10,11,12]] 

result = am.OLS(y, X).fit()
print(result.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              OIC_price   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          2.773e+05
Date:                Sat, 03 Dec 2022   Prob (F-statistic):                        0.00
Time:                        20:31:42   Log-Likelihood:                         -418.17
No. Observations:                 272   AIC:                                      858.3
Df Residuals:                     261   BIC:                                      898.0
Df Model:                          11                                                  
Covariance Type:            nonrobust                                                  
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Colombia_ny          0.3649      0.049      7.462      0.000       0.269       0.461
Colombia_europe      0.4502      0.042     10.813      0.000       0.368       0.532
Colombia_average    -0.6882      0.087     -7.879      0.000      -0.860      -0.516
Other_ny             0.0506      0.018      2.764      0.006       0.015       0.087
Other_europe         0.2169      0.024      9.141      0.000       0.170       0.264
Brazil_ny           -0.1331      0.033     -3.999      0.000      -0.199      -0.068
Brazil_europe       -0.7909      0.074    -10.635      0.000      -0.937      -0.644
Brazil_average       1.1895      0.100     11.940      0.000       0.993       1.386
Robustas_ny          0.3915      0.060      6.519      0.000       0.273       0.510
Robustas_europe      1.9955      0.253      7.888      0.000       1.497       2.494
Robustas_average    -2.0558      0.305     -6.734      0.000      -2.657      -1.455
==============================================================================
Omnibus:                       82.149   Durbin-Watson:                   0.703
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              675.352
Skew:                          -0.950   Prob(JB):                    2.23e-147
Kurtosis:                      10.482   Cond. No.                     2.57e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.57e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Now is the turn to remove the column Other_ny as it has a p-value = 0.06 and is above my limit of 0.05

In [258]:
# after removing #6 Other_average and #4 Other_ny, and keeping the rest of the columns
X = test.iloc[:, [1,2,3,5,7,8,9,10,11,12]] 

result = am.OLS(y, X).fit()
print(result.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              OIC_price   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          2.975e+05
Date:                Sat, 03 Dec 2022   Prob (F-statistic):                        0.00
Time:                        20:31:42   Log-Likelihood:                         -422.09
No. Observations:                 272   AIC:                                      864.2
Df Residuals:                     262   BIC:                                      900.2
Df Model:                          10                                                  
Covariance Type:            nonrobust                                                  
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Colombia_ny          0.3786      0.049      7.685      0.000       0.282       0.476
Colombia_europe      0.4313      0.042     10.371      0.000       0.349       0.513
Colombia_average    -0.6876      0.088     -7.773      0.000      -0.862      -0.513
Other_europe         0.2700      0.014     19.123      0.000       0.242       0.298
Brazil_ny           -0.1373      0.034     -4.079      0.000      -0.204      -0.071
Brazil_europe       -0.8339      0.074    -11.325      0.000      -0.979      -0.689
Brazil_average       1.2386      0.099     12.477      0.000       1.043       1.434
Robustas_ny          0.4489      0.057      7.867      0.000       0.337       0.561
Robustas_europe      2.0976      0.253      8.277      0.000       1.599       2.597
Robustas_average    -2.2147      0.304     -7.295      0.000      -2.812      -1.617
==============================================================================
Omnibus:                       94.306   Durbin-Watson:                   0.746
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              855.094
Skew:                          -1.108   Prob(JB):                    2.08e-186
Kurtosis:                      11.399   Cond. No.                     2.37e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.37e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The model is completed, keeping all the columns with the exemption of Columns: Other_average and Other_ny.

Model:¶

Price_OIC = (0.3649Colombia_ny) + (0.4502Colombia_europe) - (0.6882Colombia_average) + (0.0506Other_ny) + (0.2169Other_europe) - (0.1331Brazil_ny) - (0.7909Brazil_europe) + (1.1895Brazil_average) + (0.3915Robustas_ny) + (1.9955Robustas_europe) - (2.0558*Robustas_average)

Start of Machine Learning Regression Algorithms with train_test_split()¶

Regressions not Normalized¶

Methodology¶

In this section, the algorithms will be run without being normalized and further ahead they will be Normalized

methodology1.jpg

Create X train, X test, Y train and Y test¶

In [259]:
from sklearn.model_selection import train_test_split

# create X_train,X_test,y_train,y_test FROM the variables X and y, using 30% of the dataset for testing.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=10)

print(f'X_train: {len(X_train)} elements  |  y_train: {len(y_train)} elements')
print(f'X_test: {len(X_test)} elements    |  y_test: {len(y_test)} elements')
X_train: 190 elements  |  y_train: 190 elements
X_test: 82 elements    |  y_test: 82 elements

The sample has been divided into 30% testing and 70% training

Linear Regression¶

After creating the training and testing sets with the function train_test_split(), and dividing the sets 30% for testing and 70% for training, then I proceed to do a linear regression with the function LinearRegression().

In [260]:
#importing th necesary libraries
from sklearn.linear_model import LinearRegression

# create the linear regression model
model_regression = LinearRegression()

# Start measuring the time it takes to train the model
start = time.time()

# train the model
model_regression.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

#obtain a prediction
predicted_regression = model_regression.predict(X_test)

# total training time
lr_time = end - start
print(f'Training time: {lr_time}s')
Training time: 0.006580829620361328s
In [261]:
# checking the metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# call the metrics and saves them
mae = mean_absolute_error(y_test, predicted_regression)
mse = mean_squared_error(y_test, predicted_regression)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_regression)

# creates an empty dataframe to store the results for the metrics 
errors = pd.DataFrame()
errors['Metrics'] =['MAE', 'MSE', 'RMSE', 'r2']
errors['Linear_regression'] = [mae, mse, rmse, r2]
errors.set_index('Metrics', inplace = True) # sets the index
errors
Out[261]:
Linear_regression
Metrics
MAE 0.796271
MSE 1.539164
RMSE 1.240631
r2 0.999157
In [262]:
# creates an empty dataframe to store the results
results = pd.DataFrame()
results['actual'] = y_test
results['predicted_regression'] = predicted_regression
results.head(5)
Out[262]:
actual predicted_regression
180 148.240000 148.860768
24 43.460000 40.463793
183 129.020000 130.173654
214 117.262273 117.481253
56 61.470000 61.697512
In [263]:
coeff = model_regression.coef_.tolist() # converts the array of coeficients to list
coefficients = pd.DataFrame() # creates an empty dataframe
coefficients['Variable'] = X.columns.tolist() # puts the names of the columns as a list and then as a rows
coefficients['Coefficient'] = coeff # creates a dataframe with the coeficients 
coefficients.set_index('Variable') # sets the index
Out[263]:
Coefficient
Variable
Colombia_ny 0.387995
Colombia_europe 0.458673
Colombia_average -0.722548
Other_europe 0.266651
Brazil_ny -0.178374
Brazil_europe -0.981275
Brazil_average 1.428200
Robustas_ny 0.365205
Robustas_europe 1.621685
Robustas_average -1.654103
In [264]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_regression, results.actual, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual.min(), results.actual.max()], [results.actual.min(), results.actual.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Regression')

# set label in Y
ax.set_ylabel('Actual')

# to show the plot
plt.show()

The results of the metrics and the plot of Actual vs predicted results, shows that the model fits properly.

In [265]:
# Plotting Actual price vs predicted price
results.plot(kind='bar',figsize=(25,15))
plt.show()
In [266]:
#df_pred = pd.DataFrame({'actuals': y_test,'predictions': predicted_regression })

Decision Tree Regressor¶

Another Algorithm that has been studied is the Decision Tree Regressor, here we have the following metrics and they are being compared with the previous algorithm

In [267]:
# import the necesary libraries
from sklearn.tree import DecisionTreeRegressor

# create the model
model_tree = DecisionTreeRegressor(random_state=10)

# Start measuring the time it takes to train the model
start = time.time()

# train the model
model_tree.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
dt_time = end - start
print(f'Training time: {dt_time}s')

# obtain the prediction
predicted_tree = model_tree.predict(X_test)
Training time: 0.009581565856933594s
In [268]:
# calls the metrics and saves them
mae = mean_absolute_error(y_test, predicted_tree)
mse = mean_squared_error(y_test, predicted_tree)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_tree)

# dataframe to store the results for the metrics 
errors['Regression_Tree'] = [mae, mse, rmse, r2]
errors
Out[268]:
Linear_regression Regression_Tree
Metrics
MAE 0.796271 3.144109
MSE 1.539164 22.463621
RMSE 1.240631 4.739580
r2 0.999157 0.987702
In [269]:
# adds the results to the recently created dataframe 
results['predicted_tree'] = predicted_tree
results.head(5)
Out[269]:
actual predicted_regression predicted_tree
180 148.240000 148.860768 151.28
24 43.460000 40.463793 44.30
183 129.020000 130.173654 131.51
214 117.262273 117.481253 120.01
56 61.470000 61.697512 61.10
In [270]:
# # creates a linear regression decision tree
# from sklearn.tree import plot_tree
# plt.figure(figsize=(20,8))#, dpi=100)
# plot_tree(model_tree, feature_names=X.columns, fontsize = 12);
In [271]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_tree, results.actual, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual.min(), results.actual.max()], [results.actual.min(), results.actual.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Tree')

# set label in Y
ax.set_ylabel('Actual')

# to show the plot
plt.show()

Can be seen that the metrics are worse than the Linear regression, having a RMSE very high when compared to the other one.

The graph shows what we are talking about as we can see that some of the points are not that much following the line of the graph.

LASSO Regressor (Least Absolute Shrinkage and SelectionOperator)¶

Another algorithm to evaluate our model is LASSO, and after evaluation of the model, the following is given and compared with the previous ones

In [272]:
# import the necesary libraries
from sklearn.linear_model import Lasso

# creates the model
#lasso_regressor = Lasso(fit_intercept=False)
alpha = 0.1 # for normalized
beta = 1 # for not normalized
lasso_regressor = Lasso(beta)

# Start measuring the time it takes to train the model
start = time.time()

# train the model
lasso_regressor.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lasso_time = end - start
print(f'Training time: {lasso_time}s')

# obtain the predictions
predicted_lasso = lasso_regressor.predict(X_test)
Training time: 0.008007526397705078s
In [273]:
mae = mean_absolute_error(y_test, predicted_lasso)
mse = mean_squared_error(y_test, predicted_lasso)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_lasso)

# dataframe to store the results for the metrics 
errors['Regression_Lasso'] = [mae, mse, rmse, r2]
errors
Out[273]:
Linear_regression Regression_Tree Regression_Lasso
Metrics
MAE 0.796271 3.144109 1.208602
MSE 1.539164 22.463621 3.646354
RMSE 1.240631 4.739580 1.909543
r2 0.999157 0.987702 0.998004
In [274]:
results['predicted_lasso'] = predicted_lasso
results.head(5)
Out[274]:
actual predicted_regression predicted_tree predicted_lasso
180 148.240000 148.860768 151.28 148.874913
24 43.460000 40.463793 44.30 40.909175
183 129.020000 130.173654 131.51 130.039407
214 117.262273 117.481253 120.01 117.874111
56 61.470000 61.697512 61.10 59.593575
In [275]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_lasso, results.actual, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual.min(), results.actual.max()], [results.actual.min(), results.actual.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted LASSO')

# set label in Y
ax.set_ylabel('Actual')

# to show the plot
plt.show()

When plotting the graph can be seen graphically that fits better than the Decision tree and this is confirmed by the metrics above having a better RMSE than the Decision tree but being still worse than the Linear Regression.

SVM Support Vector Machine Regressor¶

The last of the algorithms being evaluated, looking at the metrics can be seen that has an awful MSE and therefore give us a very bad RMSE.

In [276]:
# import the necesary libraries
from sklearn.svm import SVR

# creates the model
svm_regressor = SVR(kernel = 'rbf') # radial base function or gaussian

# Start measuring the time it takes to train the model
start = time.time()

# train the model
svm_regressor.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
svm_time = end - start
print(f'Training time: {svm_time}s')

# obtain the predictions
svm_predicted = svm_regressor.predict(X_test)
Training time: 0.012293100357055664s
In [277]:
mae = mean_absolute_error(y_test, svm_predicted)
mse = mean_squared_error(y_test, svm_predicted)
rmse = sqrt(mse)
r2 = r2_score(y_test, svm_predicted)

# dataframe to store the results for the metrics 
errors['SVM_Regression'] = [mae, mse, rmse, r2]
errors
Out[277]:
Linear_regression Regression_Tree Regression_Lasso SVM_Regression
Metrics
MAE 0.796271 3.144109 1.208602 12.520108
MSE 1.539164 22.463621 3.646354 456.083021
RMSE 1.240631 4.739580 1.909543 21.356100
r2 0.999157 0.987702 0.998004 0.750310
In [278]:
results['predicted_svm'] = svm_predicted
results.head(5)
Out[278]:
actual predicted_regression predicted_tree predicted_lasso predicted_svm
180 148.240000 148.860768 151.28 148.874913 141.317707
24 43.460000 40.463793 44.30 40.909175 74.529230
183 129.020000 130.173654 131.51 130.039407 128.234328
214 117.262273 117.481253 120.01 117.874111 118.445614
56 61.470000 61.697512 61.10 59.593575 73.338619
In [279]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_svm, results.actual, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual.min(), results.actual.max()], [results.actual.min(), results.actual.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted SVM')

# set label in Y
ax.set_ylabel('Actual')

# to show the plot
plt.show()

Even the graph confirms that there is something wrong with the model as it is not fitting properly, SVM’s are sensitive to feature scales and must be normalized when used, and it will be done further ahead.

Regressions Normalized¶

To check for Normality please open this other file

Methodology¶

In this section, the algorithms will be run Normalized using StandardScaler()

methodology2.jpg

Split the data frame intro Train (X) and Test (y)¶

In [280]:
# creates X and y, with All the columns for X and for y = OIC_price
X = test.iloc[:, [1,2,3,5,7,8,9,10,11,12]].values  # All the variables but OIC_price, values: to get them as numpy array
y = test.iloc[:, 0].values # OIC_price, values to get them as numpy array

Normalize with StandarScaler()¶

In [281]:
# Function that applies standarization with the Standar deviation method 
from sklearn.preprocessing import StandardScaler

# normalizing with standar deviation
X_scaler = StandardScaler()
y_scaler = StandardScaler()
    
# fits and transforms the data
X = X_scaler.fit_transform(X)
y = y.reshape(-1,1) # reshape because the array has to be a 2D array or (1 column, many rows..)
y = y_scaler.fit_transform(y)   

# converting the numpy array back into dataframe
X = pd.DataFrame(X, columns = ['Colombia_ny', 'Colombia_europe', 'Colombia_average', 'Other_europe',
                               'Brazil_ny', 'Brazil_europe', 'Brazil_average', 'Robustas_ny', 'Robustas_europe',
                               'Robustas_average'])

y = np.squeeze(y) # turns the from 2 dimensions to 1 dimension

Create X train, X test, Y train and Y test¶

In [282]:
from sklearn.model_selection import train_test_split

# create X_train,X_test,y_train,y_test FROM the variables X and y, using 30% of the dataset for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=10)

print(f'X_train: {len(X_train)} elements  |  y_train: {len(y_train)} elements')
print(f'X_test: {len(X_test)} elements    |  y_test: {len(y_test)} elements')
X_train: 190 elements  |  y_train: 190 elements
X_test: 82 elements    |  y_test: 82 elements

Linear Regression Normalized¶

Now let’s apply the same algorithms to the model but this time normalizing as it was done at the Normalization stage.

The same data has been split in 30% for testing and 70% for training, the models give us the following metrics and are still compared with the previous algorithms already seen.

In [283]:
#importing th necesary libraries
from sklearn.linear_model import LinearRegression

# create the linear regression model
model_regression = LinearRegression()

# Start measuring the time it takes to train the model
start = time.time()

# train the model
model_regression.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lrn_time = end - start
print(f'Training time: {lrn_time}s')

#obtain a prediction
predicted_regression_norm = model_regression.predict(X_test)
Training time: 0.0336153507232666s
In [284]:
# calls the metrics and saves them
mae = mean_absolute_error(y_test, predicted_regression_norm)
mse = mean_squared_error(y_test, predicted_regression_norm)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_regression_norm)

# dataframe to store the results for the metrics 
errors['Linear_Regression_Norm'] = [mae, mse, rmse, r2]
errors
Out[284]:
Linear_regression Regression_Tree Regression_Lasso SVM_Regression Linear_Regression_Norm
Metrics
MAE 0.796271 3.144109 1.208602 12.520108 0.018260
MSE 1.539164 22.463621 3.646354 456.083021 0.000809
RMSE 1.240631 4.739580 1.909543 21.356100 0.028450
r2 0.999157 0.987702 0.998004 0.750310 0.999157
In [285]:
results['actual_norm'] = y_test
results['predicted_regression_norm'] = predicted_regression_norm
results.head()
Out[285]:
actual predicted_regression predicted_tree predicted_lasso predicted_svm actual_norm predicted_regression_norm
180 148.240000 148.860768 151.28 148.874913 141.317707 0.793596 0.807831
24 43.460000 40.463793 44.30 40.909175 74.529230 -1.609199 -1.677907
183 129.020000 130.173654 131.51 130.039407 128.234328 0.352846 0.379302
214 117.262273 117.481253 120.01 117.874111 118.445614 0.083220 0.088242
56 61.470000 61.697512 61.10 59.593575 73.338619 -1.196197 -1.190980
In [286]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_regression_norm, results.actual_norm, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual_norm.min(), results.actual_norm.max()], [results.actual_norm.min(), results.actual_norm.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Regression')

# set label in Y
ax.set_ylabel('Actual Normalized')

# to show the plot
plt.show()

Note that the metrics tend towards zero, meaning a much better performance than the algorithms not normalized and is confirmed graphically

In [287]:
# Plotting Actual price vs predicted price
results.plot(kind='bar',figsize=(25,15))
plt.show()

Decision Tree Regressor Normalized¶

The decision Tree regressor is analyzed in the same way as before but it has been already normalized, here we have the metrics

In [288]:
# import the necesary libraries
from sklearn.tree import DecisionTreeRegressor

# create the model
model_tree = DecisionTreeRegressor(random_state=10)

# Start measuring the time it takes to train the model
start = time.time()

# train the model
model_tree.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
dtn_time = end - start
print(f'Training time: {dtn_time}s')

# obtain the prediction
predicted_tree_norm = model_tree.predict(X_test)
Training time: 0.004998922348022461s
In [289]:
# calls the metrics and saves them
mae = mean_absolute_error(y_test, predicted_tree_norm)
mse = mean_squared_error(y_test, predicted_tree_norm)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_tree_norm)

# dataframe to store the results for the metrics 
errors['Regression_Tree_Norm'] = [mae, mse, rmse, r2]
errors
Out[289]:
Linear_regression Regression_Tree Regression_Lasso SVM_Regression Linear_Regression_Norm Regression_Tree_Norm
Metrics
MAE 0.796271 3.144109 1.208602 12.520108 0.018260 0.076258
MSE 1.539164 22.463621 3.646354 456.083021 0.000809 0.013131
RMSE 1.240631 4.739580 1.909543 21.356100 0.028450 0.114592
r2 0.999157 0.987702 0.998004 0.750310 0.999157 0.986329
In [290]:
# adds the results to the recently created dataframe 
results['predicted_tree_norm'] = predicted_tree_norm
results.head(5)
Out[290]:
actual predicted_regression predicted_tree predicted_lasso predicted_svm actual_norm predicted_regression_norm predicted_tree_norm
180 148.240000 148.860768 151.28 148.874913 141.317707 0.793596 0.807831 0.863308
24 43.460000 40.463793 44.30 40.909175 74.529230 -1.609199 -1.677907 -1.589936
183 129.020000 130.173654 131.51 130.039407 128.234328 0.352846 0.379302 0.409946
214 117.262273 117.481253 120.01 117.874111 118.445614 0.083220 0.088242 0.102184
56 61.470000 61.697512 61.10 59.593575 73.338619 -1.196197 -1.190980 -1.204682
In [291]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_tree_norm, results.actual_norm, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual_norm.min(), results.actual_norm.max()], [results.actual_norm.min(), results.actual_norm.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Regression')

# set label in Y
ax.set_ylabel('Actual Normalized')

# to show the plot
plt.show()

Is a good predictor when compared with the Not Normalized one, but is still not that good when seen comparing the RMSE of linear regression Normalized, and can be seen on the graph.

LASSO Regressor (Least Absolute Shrinkage and SelectionOperator) Normalized¶

The model Lasso once it has been normalized produces the following metrics, being this metrics much better than the Lasso Not Normalized, as the RMSE has been decreased by a lot and then I proceed with the graph

In [292]:
# import the necesary libraries
from sklearn.linear_model import Lasso

# creates the model, alpha = 0, then the lasso regression line will be the same as the least square lines, and as it gests
# smaller, until the slope is 0 
# lasso is the sum of the quared residuals + alpha x |the slope|
lasso_regressor = Lasso(alpha)

# Start measuring the time it takes to train the model
start = time.time()

# train the model
lasso_regressor.fit(X_train, y_train)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lasson_time = end - start
print(f'Training time: {lasson_time}s')

# obtain the predictions
predicted_lasso_norm = lasso_regressor.predict(X_test)
Training time: 0.009485483169555664s
In [293]:
mae = mean_absolute_error(y_test, predicted_lasso_norm)
mse = mean_squared_error(y_test, predicted_lasso_norm)
rmse = sqrt(mse)
r2 = r2_score(y_test, predicted_lasso_norm)

# dataframe to store the results for the metrics 
errors['Regression_Lasso_Norm'] = [mae, mse, rmse, r2]
errors
Out[293]:
Linear_regression Regression_Tree Regression_Lasso SVM_Regression Linear_Regression_Norm Regression_Tree_Norm Regression_Lasso_Norm
Metrics
MAE 0.796271 3.144109 1.208602 12.520108 0.018260 0.076258 0.086498
MSE 1.539164 22.463621 3.646354 456.083021 0.000809 0.013131 0.012066
RMSE 1.240631 4.739580 1.909543 21.356100 0.028450 0.114592 0.109843
r2 0.999157 0.987702 0.998004 0.750310 0.999157 0.986329 0.987439
In [294]:
results['predicted_lasso_norm'] = predicted_lasso_norm
results.head(5)
Out[294]:
actual predicted_regression predicted_tree predicted_lasso predicted_svm actual_norm predicted_regression_norm predicted_tree_norm predicted_lasso_norm
180 148.240000 148.860768 151.28 148.874913 141.317707 0.793596 0.807831 0.863308 0.740549
24 43.460000 40.463793 44.30 40.909175 74.529230 -1.609199 -1.677907 -1.589936 -1.479670
183 129.020000 130.173654 131.51 130.039407 128.234328 0.352846 0.379302 0.409946 0.334779
214 117.262273 117.481253 120.01 117.874111 118.445614 0.083220 0.088242 0.102184 0.073195
56 61.470000 61.697512 61.10 59.593575 73.338619 -1.196197 -1.190980 -1.204682 -1.083827
In [295]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.predicted_lasso_norm, results.actual_norm, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual_norm.min(), results.actual_norm.max()], [results.actual_norm.min(), results.actual_norm.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Regression')

# set label in Y
ax.set_ylabel('Actual Normalized')

# to show the plot
plt.show()

SVM Support Vector Machine Regressor Normalized¶

The last of the algorithms being evaluated, Can be seen as quite a change; it went from 21.36 to 0.07 in the RMSE metric, after being normalized

In [296]:
# import the necesary libraries
from sklearn.svm import SVR

# creates the model
svm_regressor = SVR(kernel = 'rbf') # radial base function or gaussian

# Start measuring the time it takes to train the model
start = time.time()

# train the model
svm_regressor.fit(X_train, np.ravel(y_train))

# finish measuring the time it takes to train the model
end = time.time()

# total training time
svmn_time = end - start
print(f'Training time: {svmn_time}s')

# obtain the predictions
svm_predicted_norm = svm_regressor.predict(X_test)
Training time: 0.0071866512298583984s
In [297]:
mae = mean_absolute_error(y_test, svm_predicted_norm)
mse = mean_squared_error(y_test, svm_predicted_norm)
rmse = sqrt(mse)
r2 = r2_score(y_test, svm_predicted_norm)

# dataframe to store the results for the metrics 
errors['SVM_Regression_Norm'] = [mae, mse, rmse, r2]
errors
Out[297]:
Linear_regression Regression_Tree Regression_Lasso SVM_Regression Linear_Regression_Norm Regression_Tree_Norm Regression_Lasso_Norm SVM_Regression_Norm
Metrics
MAE 0.796271 3.144109 1.208602 12.520108 0.018260 0.076258 0.086498 0.059528
MSE 1.539164 22.463621 3.646354 456.083021 0.000809 0.013131 0.012066 0.005159
RMSE 1.240631 4.739580 1.909543 21.356100 0.028450 0.114592 0.109843 0.071826
r2 0.999157 0.987702 0.998004 0.750310 0.999157 0.986329 0.987439 0.994629
In [298]:
results['svm_predicted_norm'] = svm_predicted_norm # predicted values when normalized
results.head(5)
Out[298]:
actual predicted_regression predicted_tree predicted_lasso predicted_svm actual_norm predicted_regression_norm predicted_tree_norm predicted_lasso_norm svm_predicted_norm
180 148.240000 148.860768 151.28 148.874913 141.317707 0.793596 0.807831 0.863308 0.740549 0.723952
24 43.460000 40.463793 44.30 40.909175 74.529230 -1.609199 -1.677907 -1.589936 -1.479670 -1.505413
183 129.020000 130.173654 131.51 130.039407 128.234328 0.352846 0.379302 0.409946 0.334779 0.305700
214 117.262273 117.481253 120.01 117.874111 118.445614 0.083220 0.088242 0.102184 0.073195 0.102109
56 61.470000 61.697512 61.10 59.593575 73.338619 -1.196197 -1.190980 -1.204682 -1.083827 -1.310965
In [299]:
# plotting Actual vs Predicted
# axes and size
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(results.svm_predicted_norm, results.actual_norm, edgecolors=(0, 0, 1))

# draws the line 
ax.plot([results.actual_norm.min(), results.actual_norm.max()], [results.actual_norm.min(), results.actual_norm.max()], 'r--', lw=2)

# set label in X
ax.set_xlabel('Predicted Regression')

# set label in Y
ax.set_ylabel('Actual Normalized')

# to show the plot
plt.show()

Can be seen on the graph such a change when compared with SVM Not normalized

Transposing the dataframe for Aesthetics purposes

In [300]:
errors_transposed = errors.T
errors_transposed
Out[300]:
Metrics MAE MSE RMSE r2
Linear_regression 0.796271 1.539164 1.240631 0.999157
Regression_Tree 3.144109 22.463621 4.739580 0.987702
Regression_Lasso 1.208602 3.646354 1.909543 0.998004
SVM_Regression 12.520108 456.083021 21.356100 0.750310
Linear_Regression_Norm 0.018260 0.000809 0.028450 0.999157
Regression_Tree_Norm 0.076258 0.013131 0.114592 0.986329
Regression_Lasso_Norm 0.086498 0.012066 0.109843 0.987439
SVM_Regression_Norm 0.059528 0.005159 0.071826 0.994629

Neural Network¶

Neural network is an exercise that I have done as a personal challenge and to gain knowledge on this interesting topic.

In order to do a neural Network, experts recommend normalizing the data and since it has been normalized then I will use the previously normalized data.

In [301]:
import tensorflow as tf
from tensorflow import keras

The first step is to obtain all the columns again and create X and y:

In [302]:
# creates X and y, with All the columns for X and for y = OIC_price
X = test.iloc[:, 1:].values  # All the variables but OIC_price, values: to get them as numpy array
y = test.iloc[:, 0].values # OIC_price, values to get them as numpy array
In [303]:
# Function that applies standarization with the Standar deviation method 
from sklearn.preprocessing import StandardScaler

# normalizing with standar deviation
X_scaler = StandardScaler()
y_scaler = StandardScaler()
    
# fits and transforms the data
X = X_scaler.fit_transform(X)
y = y.reshape(-1,1) # reshape because the array has to be a 2D array or (1 column, many rows..)
y = y_scaler.fit_transform(y)   

Then the model is saved as a numpy with the 2 sets of data inputs = X and OIC_price = y

In [304]:
# saves as numpy file, called neural_network, with 2 sets of data X and y
np.savez("neural_network", inputs = X, OIC_price = y)

Load the file again or just work with it, in this case it was loaded again

In [305]:
# loads the file again
Neural_Network = np.load("neural_network.npz")

If you want to see the Inputs and Output

In [306]:
#print("inputs:", Neural_Network["inputs"])
#print("OIC_price:", Neural_Network["OIC_price"])

Then the number of input layers is 12 as there are 12 columns in the data set in order to obtain the variable that we need, in this case is the column OIC_price, and this is the output layer, or basically what we want to predict.

In [307]:
num_rows, input_size = X.shape  # input_size = 12, is the number of input layers (in this case, I need all 12 columns, which is 12 neurons as input)
output_size=1 # output_size = 1 is the number of output layers (in this case, I need OIC_price, which is 1 neuron as outpu)

A dense layer is needed (from all 12 neurons to one neuron, or basically from all 12 columns to the predicted column OIC_price)

Then I have to build a model for the layers, in this case, the Sequential (a sequence of layers stacked one after another) that is for a neural network that is not advanced.

In [308]:
# dense layer (from all neurons to one neuron, in this case as an output, specify the input and output)
# I have to build a model for the layers, in this case is the Sequential 
# (is a sequence of layers stacked one after another) 
# that is for a neural network that is not advanced

models = tf.keras.Sequential([tf.keras.layers.Dense(output_size)])

I have to prepare the model to be trained (this is called a compiler), and I will use an optimizer called Adam, which allows the network to know how to use the bias and weights in an efficient way, so it learns, instead of unlearning ( gets better step by step).

The learning rate is 0.1, which indicates how to adjust the weights and bias (increasing in steps of 0.1)

In [309]:
# I have to prepare the model to be trained (this is called compiler)
# I use an optimizer called Adam, allows the network how to use the bias and weights in an efficient way
# so it learns, instead of unlearning ( get better step by step)
# learning rate is 0.1, which indicates how to adjust the weights and bias

models.compile(optimizer = tf.keras.optimizers.Adam(0.1), loss='mean_squared_error',  metrics = ['mae', 'mse'])

To train the model, I choose epochs = 400, this is how many times I want it to loop (1 loop means checking all 272 entries), usually the more epochs the better training but only until a certain point.

In [310]:
# training the model
# epochs  = 400 , is how many times I want it to loop (1 loop mean checking all 272 entries)
# usually the more epochs better training but only until certain point
# verbose = false, so it doesn't print anything

print('Begining training...')

# Start measuring the time it takes to train the model
start = time.time()

history  = models.fit(Neural_Network["inputs"], Neural_Network["OIC_price"], epochs=400, verbose=False)

# finish measuring the time it takes to train the model
end = time.time()

# total training time
nn_time = end - start

print('Finished training the model...')
print(f'\nTraining time: {nn_time}s')
Begining training...
Finished training the model...

Training time: 12.392646789550781s
In [311]:
models.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_2 (Dense)             (None, 1)                 13        
                                                                 
=================================================================
Total params: 13
Trainable params: 13
Non-trainable params: 0
_________________________________________________________________

To see the RMSE, loss, MAE and MSE in a data frame, Note that the values of loss are the same values of MSE

In [312]:
hist = pd.DataFrame(history.history)
hist['rmse'] = np.sqrt(hist['mse'])
hist.head()
Out[312]:
loss mae mse rmse
0 0.251848 0.356622 0.251848 0.501845
1 0.077468 0.217150 0.077468 0.278330
2 0.036251 0.143413 0.036251 0.190396
3 0.012998 0.085469 0.012998 0.114008
4 0.008134 0.067624 0.008134 0.090187

The algorithm trained the model and if I want to see the Weights and the Bia

In [313]:
# get the weights and bias
weights = models.layers[0].get_weights()[0]
bias = models.layers[0].get_weights()[1]
In [314]:
print(f'These are the bias variables: \n{bias}')
print(f'These are the layer variables: \n{weights}')
These are the bias variables: 
[-0.01294355]
These are the layer variables: 
[[ 0.4074403 ]
 [ 0.61317325]
 [-0.855531  ]
 [ 0.10364028]
 [ 0.13023104]
 [ 0.09133464]
 [ 0.07818925]
 [-0.29311907]
 [ 0.502885  ]
 [ 0.07967005]
 [ 0.33140746]
 [-0.19759452]]

I want to measure the Mean Squared Error MSE in order to measure the RMSE

In [315]:
# to measure the mean squared error
from sklearn.metrics import mean_squared_error

OIC_price = Neural_Network['OIC_price'].round(1)
nn_mse = mean_squared_error(y, OIC_price) #If True returns MSE value, if False returns RMSE value.
nn_rmse = mean_squared_error(y, OIC_price, squared=False) #If True returns MSE value, if False returns RMSE value.

print(f'\nrmse = {nn_rmse}') 
rmse = 0.028628030460665175

A very good RMSE, lets compare it at the en with the other algorithms

The plot of the graph of Predicted Values vs Input

In [316]:
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()

# draws a scater plot wit results_predict and results
ax.scatter(np.squeeze(models.predict_on_batch(Neural_Network['inputs'])),np.squeeze(Neural_Network['OIC_price']), edgecolors=(0, 0, 1))

#draws the line
ax.plot([np.squeeze(Neural_Network['OIC_price']).min(), np.squeeze(Neural_Network['OIC_price']).max()],
        [np.squeeze(Neural_Network['OIC_price']).min(), np.squeeze(Neural_Network['OIC_price']).max()],'r--', lw=2)

# set label in X
ax.set_xlabel('Input')

# set label in Y
ax.set_ylabel('Predicted value')

plt.show()

To measure the loss or basically how bad are the results with each loop that was done.

From here I can say that the system could have been trained with much fewer loops or epochs

In [317]:
# to see the loss, basically how bad are the results with each loop that it did
import matplotlib.pyplot as plt
fig = plt.figure(figsize = [15,7])
ax = fig.add_subplot()
plt.xlabel('Epoch Number')
plt.ylabel("Loss Magnitude")
plt.plot(history.history['loss'], label = 'loss')
#plt.plot(history.history['mae'], label = 'mae')
plt.plot(hist['rmse'], label = 'rmse')
plt.legend()
plt.show()

Start of Machine Learning Regression Algorithms with Cross Validation¶

Regressions Not Normalized¶

Methodology¶

In this section, the algorithms will be run without being normalized and further ahead they will be Normalized

methodology3.jpg

Split the data frame intro X and y¶

In [91]:
X = test.iloc[:, [1,2,3,5,7,8,9,10,11,12]] # variables of the model
y = test.iloc[:, 0] # OIC_price

Cross Validation¶

In this exercise the number of splits k = 5, with Shuffle = True

In [92]:
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold

 # number of splits
k = 5

# cross-validation method
cv = KFold(n_splits=k, random_state=10, shuffle=True)

cont = 1
# splits data into training and test set.
for train_index, test_index in cv.split(X, y):
    print(f'\nFold:{cont}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    print("\nTrain:", train_index, "\n\nValidation:",test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    cont = cont + 1
Fold:1, Train set: 217, Test set:55

Train: [  0   1   2   3   4   5   6   7   8   9  11  12  13  14  15  16  17  18
  22  23  25  27  28  29  30  31  32  33  34  36  37  39  40  41  42  44
  45  46  48  50  51  52  53  54  55  57  58  62  63  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  81  82  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109
 110 112 115 116 117 118 119 120 122 123 124 125 126 127 128 132 133 134
 135 136 137 138 139 140 141 143 144 145 146 147 148 150 151 152 153 154
 156 157 158 160 161 162 163 164 165 166 168 169 172 174 175 177 178 179
 182 184 185 186 187 188 189 190 191 195 196 197 198 199 200 201 202 203
 204 206 207 208 209 210 212 213 215 216 218 219 220 221 222 223 224 225
 226 227 228 229 231 232 234 236 237 238 239 240 241 243 244 245 246 247
 248 249 251 252 253 254 255 256 257 259 260 261 262 263 264 265 267 268
 269] 

Validation: [ 10  19  20  21  24  26  35  38  43  47  49  56  59  60  61  64  80  83
  98 105 111 113 114 121 129 130 131 142 149 155 159 167 170 171 173 176
 180 181 183 192 193 194 205 211 214 217 230 233 235 242 250 258 266 270
 271]

Fold:2, Train set: 217, Test set:55

Train: [  3   4   8  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
  25  26  27  28  29  30  31  33  34  35  36  38  39  40  41  42  43  44
  45  47  48  49  50  51  53  54  56  57  59  60  61  62  64  65  66  67
  71  72  73  74  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 103 104 105 107 108 109 110
 111 112 113 114 115 117 118 121 122 123 124 125 126 128 129 130 131 133
 134 135 137 139 140 141 142 143 144 145 146 149 150 151 152 153 154 155
 156 157 158 159 163 165 167 168 170 171 173 174 175 176 177 179 180 181
 182 183 184 185 186 187 188 189 190 191 192 193 194 197 199 200 201 202
 203 204 205 206 207 208 209 210 211 213 214 215 216 217 218 220 221 222
 224 225 226 228 229 230 231 233 234 235 238 239 240 242 243 245 246 247
 249 250 252 255 256 257 258 259 260 261 263 264 265 266 267 268 269 270
 271] 

Validation: [  0   1   2   5   6   7   9  32  37  46  52  55  58  63  68  69  70  75
  76 102 106 116 119 120 127 132 136 138 147 148 160 161 162 164 166 169
 172 178 195 196 198 212 219 223 227 232 236 237 241 244 248 251 253 254
 262]

Fold:3, Train set: 218, Test set:54

Train: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18
  19  20  21  23  24  26  27  28  29  30  31  32  33  34  35  36  37  38
  40  42  43  44  45  46  47  49  50  51  52  54  55  56  57  58  59  60
  61  62  63  64  65  68  69  70  71  73  74  75  76  77  79  80  82  83
  84  85  86  88  89  92  93  94  96  97  98 100 102 104 105 106 108 111
 112 113 114 116 118 119 120 121 122 123 125 126 127 128 129 130 131 132
 133 134 135 136 137 138 139 140 141 142 143 145 146 147 148 149 150 151
 153 155 156 158 159 160 161 162 164 165 166 167 168 169 170 171 172 173
 174 176 177 178 179 180 181 182 183 184 185 188 192 193 194 195 196 197
 198 199 200 201 202 203 205 206 208 211 212 214 215 216 217 219 220 221
 222 223 225 227 228 230 231 232 233 234 235 236 237 239 241 242 243 244
 248 249 250 251 253 254 255 256 257 258 259 261 262 263 265 266 267 269
 270 271] 

Validation: [ 17  22  25  39  41  48  53  66  67  72  78  81  87  90  91  95  99 101
 103 107 109 110 115 117 124 144 152 154 157 163 175 186 187 189 190 191
 204 207 209 210 213 218 224 226 229 238 240 245 246 247 252 260 264 268]

Fold:4, Train set: 218, Test set:54

Train: [  0   1   2   5   6   7   8   9  10  13  15  16  17  18  19  20  21  22
  24  25  26  27  30  31  32  33  35  36  37  38  39  40  41  43  46  47
  48  49  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
  68  69  70  72  73  74  75  76  77  78  80  81  83  86  87  89  90  91
  92  93  94  95  98  99 100 101 102 103 105 106 107 109 110 111 113 114
 115 116 117 119 120 121 122 123 124 125 126 127 129 130 131 132 136 137
 138 139 140 141 142 144 147 148 149 150 151 152 153 154 155 156 157 158
 159 160 161 162 163 164 166 167 169 170 171 172 173 175 176 177 178 179
 180 181 183 186 187 189 190 191 192 193 194 195 196 197 198 200 204 205
 206 207 208 209 210 211 212 213 214 216 217 218 219 221 223 224 226 227
 229 230 231 232 233 235 236 237 238 239 240 241 242 243 244 245 246 247
 248 249 250 251 252 253 254 256 258 259 260 262 264 265 266 267 268 269
 270 271] 

Validation: [  3   4  11  12  14  23  28  29  34  42  44  45  50  51  71  79  82  84
  85  88  96  97 104 108 112 118 128 133 134 135 143 145 146 165 168 174
 182 184 185 188 199 201 202 203 215 220 222 225 228 234 255 257 261 263]

Fold:5, Train set: 218, Test set:54

Train: [  0   1   2   3   4   5   6   7   9  10  11  12  14  17  19  20  21  22
  23  24  25  26  28  29  32  34  35  37  38  39  41  42  43  44  45  46
  47  48  49  50  51  52  53  55  56  58  59  60  61  63  64  66  67  68
  69  70  71  72  75  76  78  79  80  81  82  83  84  85  87  88  90  91
  95  96  97  98  99 101 102 103 104 105 106 107 108 109 110 111 112 113
 114 115 116 117 118 119 120 121 124 127 128 129 130 131 132 133 134 135
 136 138 142 143 144 145 146 147 148 149 152 154 155 157 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 178 180 181 182
 183 184 185 186 187 188 189 190 191 192 193 194 195 196 198 199 201 202
 203 204 205 207 209 210 211 212 213 214 215 217 218 219 220 222 223 224
 225 226 227 228 229 230 232 233 234 235 236 237 238 240 241 242 244 245
 246 247 248 250 251 252 253 254 255 257 258 260 261 262 263 264 266 268
 270 271] 

Validation: [  8  13  15  16  18  27  30  31  33  36  40  54  57  62  65  73  74  77
  86  89  92  93  94 100 122 123 125 126 137 139 140 141 150 151 153 156
 158 177 179 197 200 206 208 216 221 231 239 243 249 256 259 265 267 269]

Linear Regression¶

On this set, the scoring is measured with the MSE, and then take the average of all five scores, then proceed to take the square root to obtain the RMSE.

In [93]:
#importing th necesary libraries
from sklearn.linear_model import LinearRegression

# create the linear regression model
model_regression = LinearRegression()
In [94]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_lineal = cross_val_score(model_regression, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lrcv_time = end - start

print(f'Training time: {lrcv_time}s')

print(f'\nScore on each fold {scores_lineal}')
print(f'\nAverage of scores {scores_lineal.mean()*-1}')
rmse_lineal = np.sqrt(-scores_lineal.mean())
print(f'\nrmse = {rmse_lineal}')
Training time: 0.05377316474914551s

Score on each fold [-7.9966318  -3.62970645 -4.5336746  -1.68881897 -0.49033276]

Average of scores 3.667832914409396

rmse = 1.915158717811502

cross_val_score() reports scores in ascending order (largest score is best). But MSE is naturally descending scores (the smallest score is best). Thus we need to use ‘neg_mean_squared_error’ to invert the sorting. This also results in the score to be negative even though the value can never be negative. "https://www.andrewgurung.com/2018/12/28/regression-model-evaluation-mean-squared-error/"

3.3.1.1. Common cases: predefined values For the most common use cases, you can designate a scorer object with the scoring parameter; the table below shows all possible values. All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric. "https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter" (Buitinick et al., 2011)

Seems that the cross-validation is not helping in this case, however, we will have to wait until the data has been normalized to see how it behaves.

Decision Tree Regressor¶

This regressor is still behaving very quite poor, even more than the Linear Regression CV

In [95]:
# import the necesary libraries
from sklearn.tree import DecisionTreeRegressor

# create the model
model_tree = DecisionTreeRegressor(random_state=10)
In [96]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_tree = cross_val_score(model_tree, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
dtcv_time = end - start

print(f'Training time: {dtcv_time}s')

print(f'\nScore on each fold {scores_tree}')
print(f'\nAverage of scores {scores_tree.mean()*-1}')
rmse_decision_tree = np.sqrt(-scores_tree.mean())
print(f'\nrmse = {rmse_decision_tree}')
Training time: 0.052350521087646484s

Score on each fold [-88.10274436 -95.27783347 -48.00090481 -38.16288283 -44.28817509]

Average of scores 62.76650811339336

rmse = 7.922531673233838

LASSO Regressor (Least Absolute Shrinkage and SelectionOperator)¶

This regressor is still bad, better than the Decision Tree CV but worst than Linear Regression CV

In [97]:
# import the necesary libraries
from sklearn.linear_model import Lasso

# creates the model
#lasso_regressor = Lasso(fit_intercept=False)
lasso_regressor = Lasso(beta)
In [98]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_lasso = cross_val_score(lasso_regressor, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lassocv_time = end - start

print(f'Training time: {lassocv_time}s')

print(f'\nScore on each fold {scores_lasso}')
print(f'\nAverage of scores {scores_lasso.mean()*-1}')
rmse_lasso = np.sqrt(-scores_lasso.mean())
print(f'\nrmse = {rmse_lasso}')
Training time: 0.046860694885253906s

Score on each fold [-22.85179754  -1.80901588 -14.7165008   -1.38908212  -7.68926488]

Average of scores 9.691132244304956

rmse = 3.113058342579682

SVM Support Vector Machine Regressor¶

This regressor has the worst of all the RMSE’s, but as we know, SVM is very sensible tor Normalization

In [99]:
# import the necesary libraries
from sklearn.svm import SVR

# creates the model
svm_regressor = SVR(kernel = 'rbf') # radial base function or gaussian
In [100]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_svm = cross_val_score(svm_regressor, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
svmcv_time = end - start

print(f'Training time: {svmcv_time}s')

print(f'\nScore on each fold {scores_svm}')
print(f'\nAverage of scores {scores_svm.mean()*-1}')
rmse_svm = np.sqrt(-scores_svm.mean())
print(f'\nrmse = {rmse_svm}')
Training time: 0.08917093276977539s

Score on each fold [-3502.41479023    -8.09038181 -2478.37239504   -78.36867538
  -926.03160841]

Average of scores 1398.6555701758805

rmse = 37.39860385329752

Regressions Normalized¶

Methodology¶

In this section, the algorithms will be run Normalized using StandardScaler()

methodology4.jpg

Split the data frame intro Train (X) and Test (y)¶

In [101]:
# creates X and y, with All the columns for X and for y = OIC_price
X = test.iloc[:, [1,2,3,5,7,8,9,10,11,12]].values  # All the variables but OIC_price, values: to get them as numpy array
y = test.iloc[:, 0].values # OIC_price, values to get them as numpy array

Normalize with StandarScaler()¶

In [102]:
# Function that applies standarization with the Standar deviation method 
from sklearn.preprocessing import StandardScaler

# normalizing with standar deviation
X_scaler = StandardScaler()
y_scaler = StandardScaler()
    
# fits and transforms the data
X = X_scaler.fit_transform(X)
y = y.reshape(-1,1) # reshape because the array has to be a 2D array or (1 column, many rows..)
y = y_scaler.fit_transform(y)   

# converting the numpy array back into dataframe
X = pd.DataFrame(X, columns = ['Colombia_ny', 'Colombia_europe', 'Colombia_average', 'Other_europe',
                               'Brazil_ny', 'Brazil_europe', 'Brazil_average', 'Robustas_ny', 'Robustas_europe',
                               'Robustas_average'])

y = np.squeeze(y) # turns the from 2 dimensions to 1 dimension
y = pd.Series(y)

Cross Validation¶

In this exercise the number of splits k = 5, with Shuffle = True

In [103]:
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold

 # number of splits
k = 5

# cross-validation method
cv = KFold(n_splits=k, random_state=10, shuffle=True)

cont = 1
# splits data into training and test set.
for train_index, test_index in cv.split(X, y):
    print(f'\nFold:{cont}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    print("\nTrain:", train_index, "\n\nValidation:",test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    cont = cont + 1
Fold:1, Train set: 217, Test set:55

Train: [  0   1   2   3   4   5   6   7   8   9  11  12  13  14  15  16  17  18
  22  23  25  27  28  29  30  31  32  33  34  36  37  39  40  41  42  44
  45  46  48  50  51  52  53  54  55  57  58  62  63  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  81  82  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109
 110 112 115 116 117 118 119 120 122 123 124 125 126 127 128 132 133 134
 135 136 137 138 139 140 141 143 144 145 146 147 148 150 151 152 153 154
 156 157 158 160 161 162 163 164 165 166 168 169 172 174 175 177 178 179
 182 184 185 186 187 188 189 190 191 195 196 197 198 199 200 201 202 203
 204 206 207 208 209 210 212 213 215 216 218 219 220 221 222 223 224 225
 226 227 228 229 231 232 234 236 237 238 239 240 241 243 244 245 246 247
 248 249 251 252 253 254 255 256 257 259 260 261 262 263 264 265 267 268
 269] 

Validation: [ 10  19  20  21  24  26  35  38  43  47  49  56  59  60  61  64  80  83
  98 105 111 113 114 121 129 130 131 142 149 155 159 167 170 171 173 176
 180 181 183 192 193 194 205 211 214 217 230 233 235 242 250 258 266 270
 271]

Fold:2, Train set: 217, Test set:55

Train: [  3   4   8  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
  25  26  27  28  29  30  31  33  34  35  36  38  39  40  41  42  43  44
  45  47  48  49  50  51  53  54  56  57  59  60  61  62  64  65  66  67
  71  72  73  74  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 103 104 105 107 108 109 110
 111 112 113 114 115 117 118 121 122 123 124 125 126 128 129 130 131 133
 134 135 137 139 140 141 142 143 144 145 146 149 150 151 152 153 154 155
 156 157 158 159 163 165 167 168 170 171 173 174 175 176 177 179 180 181
 182 183 184 185 186 187 188 189 190 191 192 193 194 197 199 200 201 202
 203 204 205 206 207 208 209 210 211 213 214 215 216 217 218 220 221 222
 224 225 226 228 229 230 231 233 234 235 238 239 240 242 243 245 246 247
 249 250 252 255 256 257 258 259 260 261 263 264 265 266 267 268 269 270
 271] 

Validation: [  0   1   2   5   6   7   9  32  37  46  52  55  58  63  68  69  70  75
  76 102 106 116 119 120 127 132 136 138 147 148 160 161 162 164 166 169
 172 178 195 196 198 212 219 223 227 232 236 237 241 244 248 251 253 254
 262]

Fold:3, Train set: 218, Test set:54

Train: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18
  19  20  21  23  24  26  27  28  29  30  31  32  33  34  35  36  37  38
  40  42  43  44  45  46  47  49  50  51  52  54  55  56  57  58  59  60
  61  62  63  64  65  68  69  70  71  73  74  75  76  77  79  80  82  83
  84  85  86  88  89  92  93  94  96  97  98 100 102 104 105 106 108 111
 112 113 114 116 118 119 120 121 122 123 125 126 127 128 129 130 131 132
 133 134 135 136 137 138 139 140 141 142 143 145 146 147 148 149 150 151
 153 155 156 158 159 160 161 162 164 165 166 167 168 169 170 171 172 173
 174 176 177 178 179 180 181 182 183 184 185 188 192 193 194 195 196 197
 198 199 200 201 202 203 205 206 208 211 212 214 215 216 217 219 220 221
 222 223 225 227 228 230 231 232 233 234 235 236 237 239 241 242 243 244
 248 249 250 251 253 254 255 256 257 258 259 261 262 263 265 266 267 269
 270 271] 

Validation: [ 17  22  25  39  41  48  53  66  67  72  78  81  87  90  91  95  99 101
 103 107 109 110 115 117 124 144 152 154 157 163 175 186 187 189 190 191
 204 207 209 210 213 218 224 226 229 238 240 245 246 247 252 260 264 268]

Fold:4, Train set: 218, Test set:54

Train: [  0   1   2   5   6   7   8   9  10  13  15  16  17  18  19  20  21  22
  24  25  26  27  30  31  32  33  35  36  37  38  39  40  41  43  46  47
  48  49  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
  68  69  70  72  73  74  75  76  77  78  80  81  83  86  87  89  90  91
  92  93  94  95  98  99 100 101 102 103 105 106 107 109 110 111 113 114
 115 116 117 119 120 121 122 123 124 125 126 127 129 130 131 132 136 137
 138 139 140 141 142 144 147 148 149 150 151 152 153 154 155 156 157 158
 159 160 161 162 163 164 166 167 169 170 171 172 173 175 176 177 178 179
 180 181 183 186 187 189 190 191 192 193 194 195 196 197 198 200 204 205
 206 207 208 209 210 211 212 213 214 216 217 218 219 221 223 224 226 227
 229 230 231 232 233 235 236 237 238 239 240 241 242 243 244 245 246 247
 248 249 250 251 252 253 254 256 258 259 260 262 264 265 266 267 268 269
 270 271] 

Validation: [  3   4  11  12  14  23  28  29  34  42  44  45  50  51  71  79  82  84
  85  88  96  97 104 108 112 118 128 133 134 135 143 145 146 165 168 174
 182 184 185 188 199 201 202 203 215 220 222 225 228 234 255 257 261 263]

Fold:5, Train set: 218, Test set:54

Train: [  0   1   2   3   4   5   6   7   9  10  11  12  14  17  19  20  21  22
  23  24  25  26  28  29  32  34  35  37  38  39  41  42  43  44  45  46
  47  48  49  50  51  52  53  55  56  58  59  60  61  63  64  66  67  68
  69  70  71  72  75  76  78  79  80  81  82  83  84  85  87  88  90  91
  95  96  97  98  99 101 102 103 104 105 106 107 108 109 110 111 112 113
 114 115 116 117 118 119 120 121 124 127 128 129 130 131 132 133 134 135
 136 138 142 143 144 145 146 147 148 149 152 154 155 157 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 178 180 181 182
 183 184 185 186 187 188 189 190 191 192 193 194 195 196 198 199 201 202
 203 204 205 207 209 210 211 212 213 214 215 217 218 219 220 222 223 224
 225 226 227 228 229 230 232 233 234 235 236 237 238 240 241 242 244 245
 246 247 248 250 251 252 253 254 255 257 258 260 261 262 263 264 266 268
 270 271] 

Validation: [  8  13  15  16  18  27  30  31  33  36  40  54  57  62  65  73  74  77
  86  89  92  93  94 100 122 123 125 126 137 139 140 141 150 151 153 156
 158 177 179 197 200 206 208 216 221 231 239 243 249 256 259 265 267 269]

Linear Regression Normalized¶

Here we have a much better RMSE, we are tending towards zero

In [104]:
#importing th necesary libraries
from sklearn.linear_model import LinearRegression

# create the linear regression model
model_regression_norm = LinearRegression()
In [105]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_lineal_norm = cross_val_score(model_regression_norm, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lrncv_time = end - start

print(f'Training time: {lrncv_time}s')

print(f'\nScore on each fold {scores_lineal_norm}')
print(f'\nAverage of scores {scores_lineal_norm.mean()*-1}')
rmse_norm_lineal = np.sqrt(-scores_lineal_norm.mean())
print(f'\nrmse = {rmse_norm_lineal}')
Training time: 0.053802490234375s

Score on each fold [-0.00420517 -0.00190875 -0.00238411 -0.0008881  -0.00025785]

Average of scores 0.001928794662027267

rmse = 0.043918044833840986

Decision Tree Regressor Normalized¶

This algorithm has improved a lot when being Normalized, since the previous decision Tree CV had a RMSE of about 7

In [106]:
# import the necesary libraries
from sklearn.tree import DecisionTreeRegressor

# create the model
model_tree_norm = DecisionTreeRegressor(random_state=10)
In [107]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_tree_norm = cross_val_score(model_tree_norm, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
dtncv_time = end - start

print(f'\nTraining time: {dtncv_time}s')

print(f'Score on each fold {scores_tree_norm}')
print(f'\nAverage of scores {scores_tree_norm.mean()*-1}')
rmse_norm_decision_tree = np.sqrt(-scores_tree_norm.mean())
print(f'\nrmse = {rmse_norm_decision_tree}')
Training time: 0.05807852745056152s
Score on each fold [-0.04354205 -0.03958679 -0.0266162  -0.01776122 -0.02346921]

Average of scores 0.03019509650479573

rmse = 0.17376736317500974

LASSO Regressor (Least Absolute Shrinkage and SelectionOperator)¶

The algorithm is performing really well since the RMSE is tending towards zero

In [108]:
# import the necesary libraries
from sklearn.linear_model import Lasso

# creates the model
#lasso_regressor = Lasso(fit_intercept=False)
lasso_regressor_norm = Lasso(alpha)
In [109]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_lasso_norm = cross_val_score(lasso_regressor_norm, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
lassoncv_time = end - start

print(f'Training time: {lassoncv_time}s')

print(f'\nScore on each fold {scores_lasso_norm}')
print(f'\nAverage of scores {scores_lasso_norm.mean()*-1}')
rmse_norm_lasso = np.sqrt(-scores_lasso_norm.mean())
print(f'\nrmse = {rmse_norm_lasso}')
Training time: 0.04758310317993164s

Score on each fold [-0.23701501 -0.00497385 -0.08474799 -0.00344022 -0.01105593]

Average of scores 0.06824659984517778

rmse = 0.26124050192337667

SVM Support Vector Machine Regressor Normalized¶

This regressor had a RMSE of 37.39 and has improved all the way to 0.25, this confirms that is always best to normalize the data in order to obtain much better results

In [110]:
# import the necesary libraries
from sklearn.svm import SVR

# creates the model
svm_regressor_norm = SVR(kernel = 'rbf') # radial base function or gaussian
In [111]:
# Start measuring the time it takes to train the model
start = time.time()

# cross validation to evaluate the model
scores_svm_norm = cross_val_score(svm_regressor_norm, X, y, cv = k, scoring='neg_mean_squared_error')

# finish measuring the time it takes to train the model
end = time.time()

# total training time
svmncv_time = end - start

print(f'Training time: {svmncv_time} s')

print(f'\nScore on each fold {scores_svm_norm}')
print(f'\nAverage of scores {scores_svm_norm.mean()*-1}')
rmse_norm_svm = np.sqrt(-scores_svm_norm.mean())
print(f'\nrmse = {rmse_norm_svm}')
Training time: 0.06256628036499023 s

Score on each fold [-0.2404445  -0.0063709  -0.07527134 -0.00384803 -0.00245652]

Average of scores 0.06567825455570615

rmse = 0.256277690319907

Efficiency or Timing of the algorithms¶

In order to measure the time each model takes to be trained or tested, a data frame was created with the values of each algorithm when being timed.

In [112]:
# Efficiency, by measuring the time each model took to be trained or tested
times =  pd.DataFrame()
times['names'] = ['Linear_regression', 'Decision_tree', 'LASSO_regression', 'SVM_regression',
                 'Linear_regression_norm', 'Decision_tree_norm', 'LASSO_regression_norm', 'SVM_regression_norm', 
                  'Linear_regression_cv', 'Decision_tree_cv', 'LASSO_regression_cv', 'SVM_regression_cv', 
                 'Linear_regression_norm_cv', 'Decision_tree_norm_cv', 'LASSO_regression_norm_cv', 'SVM_regression_norm_cv']

times['times'] = [lr_time, dt_time, lasso_time, svm_time,
                lrn_time, dtn_time, lasson_time, svmn_time,
                lrcv_time, dtcv_time, lassocv_time, svmcv_time,
                lrncv_time, dtncv_time, lassoncv_time, svmncv_time,]

times.set_index('names', inplace = True) # sets the index
times
Out[112]:
times
names
Linear_regression 0.019656
Decision_tree 0.007150
LASSO_regression 0.004242
SVM_regression 0.012968
Linear_regression_norm 0.012022
Decision_tree_norm 0.007905
LASSO_regression_norm 0.000000
SVM_regression_norm 0.005477
Linear_regression_cv 0.053773
Decision_tree_cv 0.052351
LASSO_regression_cv 0.046861
SVM_regression_cv 0.089171
Linear_regression_norm_cv 0.053802
Decision_tree_norm_cv 0.058079
LASSO_regression_norm_cv 0.047583
SVM_regression_norm_cv 0.062566

The slowest algorithm is SVM Regression CV when doing Cross Validation and the same algorithm when being normalized and using the same Cross-Validation technique has a time that is much less

The fastest algorithm is SVM regression Normalized when doing the exercise with Train and test validation sets, and the same algorithm when doing Cross validation became the slowest

In [113]:
# dropping as they have very high numbers
#metrics.drop(['SVM_Regression', 'Regression_Tree'], inplace = True)
times.plot(kind='bar',figsize=(25,15))
plt.show()

Note that there is a difference using cross-validation vs Training and testing set, where using Cross-validation increases the timing of the algorithms by almost more than double

The slowest algorithm

In [114]:
# checking the maximum number in the column times, or slowest algorithm
times[times['times'] == times['times'].max()]
Out[114]:
times
names
SVM_regression_cv 0.089171

The fastest algorithm

In [115]:
# checking the minimum number in the column times, or fastest algorithm
times[times['times'] == times['times'].min()]
Out[115]:
times
names
LASSO_regression_norm 0.0

Effectiveness in Cross Validation¶

In [116]:
metrics =  pd.DataFrame()
metrics['Algorithm'] = ['Linear_regression', 'Regression_Tree', 'Regression_Lasso', 'SVM_Regression',
                 'Linear_Regression_Norm', 'Regression_Tree_Norm', 'Regression_Lasso_Norm', 'SVM_Regression_Norm']


metrics['MSE'] = [scores_lineal.mean()*-1, scores_tree.mean()*-1, scores_lasso.mean()*-1, scores_svm.mean()*-1,
                          scores_lineal_norm.mean()*-1, scores_tree_norm.mean()*-1, scores_lasso_norm.mean()*-1, 
                          scores_svm_norm.mean()*-1]

metrics['RMSE'] = [rmse_lineal, rmse_decision_tree, rmse_lasso, rmse_svm, rmse_norm_lineal, rmse_norm_decision_tree,
                rmse_norm_lasso, rmse_norm_svm]

metrics.set_index('Algorithm', inplace = True) # sets the index

metrics
Out[116]:
MSE RMSE
Algorithm
Linear_regression 3.667833 1.915159
Regression_Tree 62.766508 7.922532
Regression_Lasso 9.691132 3.113058
SVM_Regression 1398.655570 37.398604
Linear_Regression_Norm 0.001929 0.043918
Regression_Tree_Norm 0.030195 0.173767
Regression_Lasso_Norm 0.068247 0.261241
SVM_Regression_Norm 0.065678 0.256278

Can be seen that the most effective algorithm when doing Cross Validation, is the algorithm of Linear Regression Normalized, followed by Regression tree Normalized, note that SVM Regression and Regression Tree are very bad when not normalized, but Regression tree becomes really good once has been normalized.

Table measuring MSE and RMSE but dropping SVM_Regression and Regression_tree

In [117]:
metrics_droped = metrics.copy()

# dropping as they have very high numbers
metrics_droped.drop(['SVM_Regression', 'Regression_Tree'],inplace = True)
metrics_droped.plot(kind='bar',figsize=(25,15))
plt.show()

Effectiveness in Train set and Test set¶

In [118]:
# transposing the data frame
errors = errors.T

errors_droped = errors.copy()

# dropping columns r2 and MAE
errors_droped = errors_droped.drop(columns = ['r2', 'MAE'])
errors_droped
Out[118]:
Metrics MSE RMSE
Linear_regression 1.539164 1.240631
Regression_Tree 22.463621 4.739580
Regression_Lasso 3.646354 1.909543
SVM_Regression 456.083021 21.356100
Linear_Regression_Norm 0.000809 0.028450
Regression_Tree_Norm 0.013131 0.114592
Regression_Lasso_Norm 0.012066 0.109843
SVM_Regression_Norm 0.005159 0.071826

Can be seen that the best algorithm when having a training set and testing set is Linear regression Normalized followed by SVM Regression Normalized. Note that SVM Regression and Regression Tree had to be dropped as they both have very high numbers, and once they have been normalized they perform really well, which explains why is good to normalize when doing linear regression problems.

Table measuring MSE and RMSE but dropping SVM_Regression and Regression_tree

In [119]:
# dropping SVM_Regression and Regression_Tree as they have very high numbers
errors_droped.drop(['SVM_Regression', 'Regression_Tree'], inplace = True)
errors_droped.plot(kind='bar',figsize=(25,15))
plt.show()

When comparing both results side by side, the following table is obtained

In [120]:
# comparing both data frames
comparation=errors_droped.compare(metrics_droped)
comparation
Out[120]:
Metrics MSE RMSE
self other self other
Linear_regression 1.539164 3.667833 1.240631 1.915159
Regression_Lasso 3.646354 9.691132 1.909543 3.113058
Linear_Regression_Norm 0.000809 0.001929 0.028450 0.043918
Regression_Tree_Norm 0.013131 0.030195 0.114592 0.173767
Regression_Lasso_Norm 0.012066 0.068247 0.109843 0.261241
SVM_Regression_Norm 0.005159 0.065678 0.071826 0.256278

can be seen from the graph, that the best algorithm to evaluate the model is the Linear Regression Normalized as it has the minimum values in both exercises. Please note that SVM_Regression, Regression_Tree, and 'Regression_Lasso have been dropped from the graph due that their high values will make disappearing from the other values on the graph.

In [121]:
# dropping 'SVM_Regression', 'Regression_Tree' and 'Regression_Lasso' as they are quite high
#comparation.drop(['SVM_Regression', 'Regression_Tree','Regression_Lasso'], inplace = True)
comparation.plot(kind='bar',figsize=(25,15))
plt.show()

Linear Regression Normalized has the lowest values above all. SVM_Regression, Regression_Tree and Regression_Lasso have been droped as they are quite high in values and will make disappear the other values on the graph

Now, there are different algorithms that excel in different categories, so which should be chosen? in this case I will standarize everything to put it on a common scale (Z-Score) in order to compare and then get the average, and from here select the minimum values

In [122]:
metrics['times'] = [lrcv_time, dtcv_time, lassocv_time, svmcv_time,lrncv_time, dtncv_time, lassoncv_time, svmncv_time,]
metrics.drop('MSE', axis = 1, inplace = True)
metrics
Out[122]:
RMSE times
Algorithm
Linear_regression 1.915159 0.053773
Regression_Tree 7.922532 0.052351
Regression_Lasso 3.113058 0.046861
SVM_Regression 37.398604 0.089171
Linear_Regression_Norm 0.043918 0.053802
Regression_Tree_Norm 0.173767 0.058079
Regression_Lasso_Norm 0.261241 0.047583
SVM_Regression_Norm 0.256278 0.062566

Dataframe with Z-Score of each column

In [123]:
# adding Neural Network to the dataframe
metrics.loc['Neural_Network'] = [nn_rmse, nn_time]
In [124]:
metrics['RMSE_zscore'] = stats.zscore(metrics['RMSE'])
metrics['times_zscore'] = stats.zscore(metrics['times'])
metrics
Out[124]:
RMSE times RMSE_zscore times_zscore
Algorithm
Linear_regression 1.915159 0.053773 -0.328082 -0.354755
Regression_Tree 7.922532 0.052351 0.195528 -0.355158
Regression_Lasso 3.113058 0.046861 -0.223672 -0.356714
SVM_Regression 37.398604 0.089171 2.764701 -0.344727
Linear_Regression_Norm 0.043918 0.053802 -0.491182 -0.354747
Regression_Tree_Norm 0.173767 0.058079 -0.479864 -0.353536
Regression_Lasso_Norm 0.261241 0.047583 -0.472240 -0.356509
SVM_Regression_Norm 0.256278 0.062566 -0.472673 -0.352264
Neural_Network 0.028628 11.289489 -0.492515 2.828411

Dataframe with the mean of both Z-scores

In [125]:
# column mean is the average of the z-scores of each column (in a per row basis)
metrics['mean'] = (metrics['RMSE_zscore'] + metrics['times_zscore'])/2
metrics
Out[125]:
RMSE times RMSE_zscore times_zscore mean
Algorithm
Linear_regression 1.915159 0.053773 -0.328082 -0.354755 -0.341419
Regression_Tree 7.922532 0.052351 0.195528 -0.355158 -0.079815
Regression_Lasso 3.113058 0.046861 -0.223672 -0.356714 -0.290193
SVM_Regression 37.398604 0.089171 2.764701 -0.344727 1.209987
Linear_Regression_Norm 0.043918 0.053802 -0.491182 -0.354747 -0.422965
Regression_Tree_Norm 0.173767 0.058079 -0.479864 -0.353536 -0.416700
Regression_Lasso_Norm 0.261241 0.047583 -0.472240 -0.356509 -0.414375
SVM_Regression_Norm 0.256278 0.062566 -0.472673 -0.352264 -0.412468
Neural_Network 0.028628 11.289489 -0.492515 2.828411 1.167948
In [126]:
# takes the minimimum of the column mean
metrics[metrics['mean'] == metrics['mean'].min()]
Out[126]:
RMSE times RMSE_zscore times_zscore mean
Algorithm
Linear_Regression_Norm 0.043918 0.053802 -0.491182 -0.354747 -0.422965
In [127]:
# takes the minimum from the column RMSE as Time changes everytime is run, but RMSE is constant 
metrics[metrics['RMSE'] == metrics['RMSE'].min()]
Out[127]:
RMSE times RMSE_zscore times_zscore mean
Algorithm
Neural_Network 0.028628 11.289489 -0.492515 2.828411 1.167948

Once the values have been scaled and the mean has been calculated, the program will select the minimum value of the mean and this will be the best algorithm that performs based in Time and RMSE

The time of execution of each algorithm changes every time is run, and the times are way too short to be worried about, in this particular case a microsecond or even a few seconds won't make a difference, the best algorithm to use is Linear Regression Normalized as has the best metric, independent of time.

However, note that Neural Network has a better RMSE than Linear Regression Normalized, and there is a difference in change of 53,4% between both them

In [ ]: